Prometheus Anomaly detection: Z-Score in PromQL
Monitoring HTTP request rates is one of the most basic yet essential tasks in observability. A sudden spike might indicate a traffic surge or even a DDoS attack, while a sudden drop could signal a backend failure. Static thresholds work, but they often miss subtle patterns or raise too many false alarms. A better way is to use statistical anomaly detection—specifically Z-score based alerts in Prometheus.
In this post, we’ll walk through how to set up a Z-score PromQL alert to detect anomalies in HTTP request rates using only Prometheus and native PromQL.
What You’ll Need
- A running Prometheus instance with HTTP request metrics (e.g., from a Node Exporter, NGINX Exporter, or app instrumentation)
- Some familiarity with PromQL
- Optional: Alertmanager for alert routing
What is Z-score Anomaly Detection?
Z-score is a simple way to determine how far a data point is from the mean in terms of standard deviation.
The formula is:
Z = (x - μ) / σ
Where:
x
is the current valueμ
is the mean (average)σ
is the standard deviation
In plain terms: if your HTTP request rate is consistently around 100 requests per minute, and suddenly jumps to 200, Z-score will quantify how abnormal that is. If the score is above a threshold (commonly 3), we treat it as an anomaly.
Writing the PromQL Query
rate(http_requests_total[1m])
is the promql query that tels us the number of request per minute.
(
rate(http_requests_total[1m])
- avg_over_time(rate(http_requests_total[1m])[15m:])
)
/
stddev_over_time(rate(http_requests_total[1m])[15m:])
This gives you a Z-score that updates in real-time.
Adjusting window
The z-score is based on a sliding-window average. That’s why it' important to choose your window carefully. Example:
- For an-ecommerce: 24h is more adapted
- For a bank: 24h or 7d could more adapted because there is a seasonailty of 1 week (a pattern that repeats each 1 week).
Setting the Alert Rule
Z-score shows anomalies starting from the value 3
. To avoid false positive alerts, I will set it to 5
to reduces the noise.
groups:
- name: http_anomaly_alert
rules:
- alert: HighHttpRequestAnomaly
expr: |
(
rate(http_requests_total[1m])
- avg_over_time(rate(http_requests_total[1m])[15m:])
)
/
stddev_over_time(rate(http_requests_total[1m])[15m:])
> 3
for: 2m
labels:
severity: warning
annotations:
summary: "HTTP request anomaly detected"
description: "Z-score exceeds threshold: possible abnormal traffic."
The for: 2m
means the condition must hold for 2 minutes before the alert fires, to reduce noise from short spikes.
Testing the Alert
To test, you can artificially increase traffic to your endpoint (e.g., with curl
in a loop or a load testing tool like ab
or wrk
). Once the rate jumps beyond what’s statistically normal, the Z-score should rise and eventually trigger the alert.
You can also visualize the Z-score in Grafana by pasting the expression into a panel and setting alert thresholds visually.
Conclusion
Using Z-score in PromQL gives you a simple but powerful way to detect anomalies in HTTP traffic without relying on fixed thresholds. It adapts to traffic patterns and helps catch unusual behavior as it happens.
Next steps you can try:
- Tune the window size (e.g., 5m vs 15m) for more responsive or stable alerts.
- Visualize the Z-score alongside request metrics in Grafana.
- Use a similar approach for CPU usage, latency, or error rate anomalies.