/ #prometheus #victoriametrics 

High-Cardinality Metrics: Detection and Optimization in Prometheus and VictoriaMetrics

High-Cardinality Metrics: Detection and Optimization in Prometheus and VM

I remember the first time my Prometheus instance crashed spectacularly after I added a new exporter. The logs screamed about “out of memory” errors, and my Grafana dashboards turned into ghost towns. After some frantic debugging, I discovered the culprit: high-cardinality metrics.

In this guide, I’ll share practical techniques I’ve learned for identifying and optimizing these metric monsters in both Prometheus and VictoriaMetrics.

Why High-Cardinality Metrics Matter

High-cardinality metrics occur when a metric has too many unique label combinations. Common offenders include:

  • HTTP request metrics with full URLs as labels
  • User-specific metrics with user IDs
  • Container metrics with randomly generated pod names

These can cause:

  1. Memory explosions in Prometheus
  2. Slow queries in both Prometheus and VictoriaMetrics
  3. Storage bloat from excessive time series

Tools You’ll Need

  • A running Prometheus instance (installation guide)
  • VictoriaMetrics (installation docs)
  • PromQL and MetricsQL knowledge
  • promtool (comes with Prometheus)
  • A terminal with curl and jq installed

Step 1: Identifying High-Cardinality Metrics

Using Prometheus UI

  1. Run this query to find metrics with the most series:
topk(10, count by (__name__)({__name__=~".+"}))
  1. For a specific metric, check its cardinality:
count(count by (le, method, path, status)(http_request_duration_seconds_bucket))

Using VictoriaMetrics

VictoriaMetrics provides special tools for this:

# List highest cardinality metrics
curl http://vmselect:8481/select/0/prometheus/api/v1/series/count | jq

Using Promtool

promtool tsdb analyze /path/to/prometheus/data | grep -A10 "High cardinality"

Step 2: Optimizing High-Cardinality Metrics

Strategy 1: Reduce Label Cardinality

Instead of:

- name: http_requests_total
  labels:
    user_id: "12345"
    path: "/api/v1/users/12345/profile"

Use:

- name: http_requests_total
  labels:
    user_type: "registered" # Instead of user_id
    path_pattern: "/api/v1/users/:id/profile" # Parameterized path

Strategy 2: Use Recording Rules

Create aggregation rules in prometheus.rules.yml:

groups:
  - name: http_aggregated
    rules:
      - record: http_request_duration_seconds:rate5m
        expr: |
          sum by (service, status_code, method) (
            rate(http_request_duration_seconds_bucket[5m])
          )          

Strategy 3: VictoriaMetrics Specific Optimizations

VictoriaMetrics offers several helpful features:

  1. Deduplication:
# In vmstorage args
-enableDeduplication
  1. Limiting series creation:
-storage.maxHourlySeries=1000000 -storage.maxDailySeries=10000000

Step 4: Monitoring Cardinality

Create a dashboard panel with these queries:

Prometheus:

sum(scrape_series_added) by (job)

VictoriaMetrics:

vm_metrics_with_highest_cardinality

Troubleshooting Common Issues

Problem: Prometheus crashes with “out of memory” errors
Solution:

  1. Reduce scrape interval for high-cardinality jobs
  2. Set --storage.tsdb.retention.time=7d to limit storage impact

Problem: Queries timeout
Solution:

  1. Use more specific time ranges
  2. Create pre-aggregated recording rules

Going Further

For additional optimizations:

  1. Consider Prometheus relabeling to drop unnecessary labels
  2. Explore VictoriaMetrics' downsampling
  3. Implement cardinality limits in Prometheus 2.30+

FAQ

Q: How many labels is too many?
A: There’s no hard rule, but metrics with >10,000 unique series often cause problems.

Q: Should I use VictoriaMetrics instead of Prometheus?
A: VictoriaMetrics handles high-cardinality better, but both benefit from optimization.

Q: Can I fix this without modifying my exporters?
A: Yes! Use metric relabeling in your Prometheus config to drop or hash high-card labels.

For more on monitoring fundamentals, check out my guide on Grafana alerting with custom messages.