/ #prometheus #monitoring 

Prometheus Anomaly detection: Z-Score in PromQL

Monitoring HTTP request rates is one of the most basic yet essential tasks in observability. A sudden spike might indicate a traffic surge or even a DDoS attack, while a sudden drop could signal a backend failure. Static thresholds work, but they often miss subtle patterns or raise too many false alarms. A better way is to use statistical anomaly detection—specifically Z-score based alerts in Prometheus.

In this post, we’ll walk through how to set up a Z-score PromQL alert to detect anomalies in HTTP request rates using only Prometheus and native PromQL.

What You’ll Need

  • A running Prometheus instance with HTTP request metrics (e.g., from a Node Exporter, NGINX Exporter, or app instrumentation)
  • Some familiarity with PromQL
  • Optional: Alertmanager for alert routing

What is Z-score Anomaly Detection?

Z-score is a simple way to determine how far a data point is from the mean in terms of standard deviation.

The formula is:

Z = (x - μ) / σ

Where:

  • x is the current value
  • μ is the mean (average)
  • σ is the standard deviation

In plain terms: if your HTTP request rate is consistently around 100 requests per minute, and suddenly jumps to 200, Z-score will quantify how abnormal that is. If the score is above a threshold (commonly 3), we treat it as an anomaly.


Writing the PromQL Query

rate(http_requests_total[1m]) is the promql query that tels us the number of request per minute.

(
  rate(http_requests_total[1m])
  - avg_over_time(rate(http_requests_total[1m])[15m:])
)
/
stddev_over_time(rate(http_requests_total[1m])[15m:])

This gives you a Z-score that updates in real-time.

Adjusting window

The z-score is based on a sliding-window average. That’s why it' important to choose your window carefully. Example:

  • For an-ecommerce: 24h is more adapted
  • For a bank: 24h or 7d could more adapted because there is a seasonailty of 1 week (a pattern that repeats each 1 week).

Setting the Alert Rule

Z-score shows anomalies starting from the value 3. To avoid false positive alerts, I will set it to 5 to reduces the noise.

groups:
  - name: http_anomaly_alert
    rules:
      - alert: HighHttpRequestAnomaly
        expr: |
          (
            rate(http_requests_total[1m])
            - avg_over_time(rate(http_requests_total[1m])[15m:])
          )
          /
          stddev_over_time(rate(http_requests_total[1m])[15m:])
          > 3          
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "HTTP request anomaly detected"
          description: "Z-score exceeds threshold: possible abnormal traffic."

The for: 2m means the condition must hold for 2 minutes before the alert fires, to reduce noise from short spikes.

Testing the Alert

To test, you can artificially increase traffic to your endpoint (e.g., with curl in a loop or a load testing tool like ab or wrk). Once the rate jumps beyond what’s statistically normal, the Z-score should rise and eventually trigger the alert.

You can also visualize the Z-score in Grafana by pasting the expression into a panel and setting alert thresholds visually.

Conclusion

Using Z-score in PromQL gives you a simple but powerful way to detect anomalies in HTTP traffic without relying on fixed thresholds. It adapts to traffic patterns and helps catch unusual behavior as it happens.

Next steps you can try:

  • Tune the window size (e.g., 5m vs 15m) for more responsive or stable alerts.
  • Visualize the Z-score alongside request metrics in Grafana.
  • Use a similar approach for CPU usage, latency, or error rate anomalies.