High-Cardinality Metrics: Detection and Optimization in Prometheus and VictoriaMetrics
High-Cardinality Metrics: Detection and Optimization in Prometheus and VM
I remember the first time my Prometheus instance crashed spectacularly after I added a new exporter. The logs screamed about “out of memory” errors, and my Grafana dashboards turned into ghost towns. After some frantic debugging, I discovered the culprit: high-cardinality metrics.
In this guide, I’ll share practical techniques I’ve learned for identifying and optimizing these metric monsters in both Prometheus and VictoriaMetrics.
Why High-Cardinality Metrics Matter
High-cardinality metrics occur when a metric has too many unique label combinations. Common offenders include:
- HTTP request metrics with full URLs as labels
- User-specific metrics with user IDs
- Container metrics with randomly generated pod names
These can cause:
- Memory explosions in Prometheus
- Slow queries in both Prometheus and VictoriaMetrics
- Storage bloat from excessive time series
Tools You’ll Need
- A running Prometheus instance (installation guide)
- VictoriaMetrics (installation docs)
- PromQL and MetricsQL knowledge
promtool
(comes with Prometheus)- A terminal with
curl
andjq
installed
Step 1: Identifying High-Cardinality Metrics
Using Prometheus UI
- Run this query to find metrics with the most series:
topk(10, count by (__name__)({__name__=~".+"}))
- For a specific metric, check its cardinality:
count(count by (le, method, path, status)(http_request_duration_seconds_bucket))
Using VictoriaMetrics
VictoriaMetrics provides special tools for this:
# List highest cardinality metrics
curl http://vmselect:8481/select/0/prometheus/api/v1/series/count | jq
Using Promtool
promtool tsdb analyze /path/to/prometheus/data | grep -A10 "High cardinality"
Step 2: Optimizing High-Cardinality Metrics
Strategy 1: Reduce Label Cardinality
Instead of:
- name: http_requests_total
labels:
user_id: "12345"
path: "/api/v1/users/12345/profile"
Use:
- name: http_requests_total
labels:
user_type: "registered" # Instead of user_id
path_pattern: "/api/v1/users/:id/profile" # Parameterized path
Strategy 2: Use Recording Rules
Create aggregation rules in prometheus.rules.yml
:
groups:
- name: http_aggregated
rules:
- record: http_request_duration_seconds:rate5m
expr: |
sum by (service, status_code, method) (
rate(http_request_duration_seconds_bucket[5m])
)
Strategy 3: VictoriaMetrics Specific Optimizations
VictoriaMetrics offers several helpful features:
- Deduplication:
# In vmstorage args
-enableDeduplication
- Limiting series creation:
-storage.maxHourlySeries=1000000 -storage.maxDailySeries=10000000
Step 4: Monitoring Cardinality
Create a dashboard panel with these queries:
Prometheus:
sum(scrape_series_added) by (job)
VictoriaMetrics:
vm_metrics_with_highest_cardinality
Troubleshooting Common Issues
Problem: Prometheus crashes with “out of memory” errors
Solution:
- Reduce scrape interval for high-cardinality jobs
- Set
--storage.tsdb.retention.time=7d
to limit storage impact
Problem: Queries timeout
Solution:
- Use more specific time ranges
- Create pre-aggregated recording rules
Going Further
For additional optimizations:
- Consider Prometheus relabeling to drop unnecessary labels
- Explore VictoriaMetrics' downsampling
- Implement cardinality limits in Prometheus 2.30+
FAQ
Q: How many labels is too many?
A: There’s no hard rule, but metrics with >10,000 unique series often cause problems.
Q: Should I use VictoriaMetrics instead of Prometheus?
A: VictoriaMetrics handles high-cardinality better, but both benefit from optimization.
Q: Can I fix this without modifying my exporters?
A: Yes! Use metric relabeling in your Prometheus config to drop or hash high-card labels.
For more on monitoring fundamentals, check out my guide on Grafana alerting with custom messages.