Detect Outliers in Your Metrics: A Practical Guide to Grafana Machine Learning
I’ve always been the kind of person who stares at a Grafana dashboard, watching a squiggly line, and wondering, “Is that dip normal? Should I be worried?” For years, my answer was to set up a basic threshold alert. If CPU usage goes above 90%, page someone. But what about the weird, subtle stuff? The slow creep of memory leakage or the sudden, inexplicable drop in request rate that doesn’t cross any static line but just feels… off?
That’s where anomaly detection comes in. Instead of you defining the rules, you can let machine learning models learn the normal behavior of your systems and flag the outliers for you. And the best part? You don’t need a PhD in data science to set it up. Grafana’s built-in Machine Learning plugin brings powerful, unsupervised anomaly detection right into the dashboard you already know and love.
In this guide, I’ll walk you through exactly how I set it up to monitor my own home lab infrastructure, using data from Prometheus. We’ll turn those gut feelings into concrete, actionable alerts.
What You’ll Need
Before we start, let’s make sure you have everything. You don’t need much, which is the beauty of it.
- A Running Grafana Instance (v9.5.0 or later): The ML features are enterprise-grade but are available for free in the Grafana Cloud Free plan and the generous Grafana Enterprise Trial. I’m using my self-hosted Grafana instance with an active trial license.
- A Data Source with Metrics: We need data to analyze! This tutorial will work with:
- Prometheus: The CNCF time series darling. If you need help setting it up, I’ve written about monitoring Home Assistant with Prometheus before.
- VictoriaMetrics: A powerful, open-source alternative that is Prometheus-compatible.
- The Grafana Machine Learning Plugin: This is installed by default in Grafana v9.5.0 and above. Just need to enable it.
Step 1: Enabling and Accessing the Machine Learning Features
First, log into your Grafana instance. The ML features are baked right into the core, but you need to ensure the plugin is enabled.
- Navigate to Administration > Plugins.
- Search for “Machine Learning”.
- Ensure the plugin is enabled. It should be by default.
If you’re on a fresh install, you’re good to go. The main entry point is the “Machine Learning” tab in the main menu. Click it.
Heads up! If you don't see this tab, you might need to check your license or user permissions. You need the Grafana Enterprise Trial or a Cloud account to access these features. The trial is automatic and free for self-hosted instances.
Step 2: Your First Anomaly Detection Job
The plugin works by creating “jobs” that run in the background, training models on your selected metrics and then scoring new data points for how anomalous they are. Let’s create a simple one.
- In the ML menu, click Create job > Metric job.
- You’ll be presented with a screen to configure your job. It looks complex, but we’ll break it down.
Picking the Right Metric
This is the most important step. Not all metrics are created equal for anomaly detection. You want something that has a generally predictable pattern but occasional, meaningful deviations.
- Great candidates: CPU usage, request latency, traffic rates, memory usage of a stable service.
- Poor candidates: Constantly spiky metrics, counters that only go up (like
node_boot_time_seconds
), or metrics with no historical data.
For this example, let’s use a classic: node_memory_MemAvailable_bytes
. The available memory on a machine should be relatively stable, and a sudden drop or rise could indicate a problem.
In the Metric field, start typing node_memory_MemAvailable_bytes
and select it from your data source.
Configuring the Job Settings
The default settings are a great starting point. Here’s what they mean:
- Training Window: The period of historical data the model will use to learn “normal” behavior. The default of
7d
(7 days) is perfect. It learns a weekly seasonality. - Interval: How often the job runs to calculate new anomaly scores.
5m
means it will process the last 5 minutes of data every 5 minutes. - Hyperparameters: You can mostly leave these on “Auto” for now. The
tolerance
parameter controls how sensitive the model is. A lower value (e.g.,0.1
) makes it more sensitive, flagging more things as anomalous. A higher value (e.g.,0.8
) makes it more conservative. Start with0.5
.
Give your job a clear name, like “Memory Available Anomaly Detection”.
Click Create, and your job is now live! It will first backfill anomaly scores for the training window and then continue evaluating new data.
Step 3: Visualizing Anomalies in a Dashboard
The job is running, but how do we see the results? The ML plugin exposes a new metric called __ml_anomaly_score
. A score of 0
means “perfectly normal”. A score approaching 1
means “highly anomalous”.
Let’s add this to an existing dashboard or create a new one.
Create a new panel and set its query to your original metric:
node_memory_MemAvailable_bytes{instance="$instance"}
Style this as a nice blue line.
Now, add a second query to overlay the anomaly score. The magic happens here:
__ml_analysis_anomaly_score{job="Memory Available Anomaly Detection"} > 0
This query will return a value of
1
whenever an anomaly is detected.Change the visualization for this second query. Go to the Transform tab and add a “Override properties” transformation.
- Match by name:
__ml_analysis_anomaly_score{job="Memory Available Anomaly Detection"} > 0
- Override:
Color
-> Set to a bright, alarming red (like#FF0000
). - Override:
Visualization
-> Set toPoints
.
- Match by name:
Go back to the Panel tab and in the Graph styles section, set the “Point size” to something visible, like
5
.
What you’ve created is a time series graph of your memory. Whenever the ML model detects an anomaly, it will place a prominent red dot directly on the graph at that point in time. It’s incredibly intuitive—you can literally see the weirdness.
Pro Tip: The Power of Transformations
You can get even fancier. Use a Threshold
transformation on the anomaly score itself (a value between 0-1) to only show points when the score exceeds, say, 0.7. This helps filter out minor "maybe" anomalies and only highlight the serious ones.
Step 4: Alerting on Anomalies
Visualization is cool, but we need to be notified. This is where the real power is unlocked. We can create an alert that fires not on a static threshold, but on the model’s confidence that something is wrong.
- Create a new alert rule in Grafana.
- For the query, we want to use the raw anomaly score:
__ml_analysis_anomaly_score{job="Memory Available Anomaly Detection"}
- Set the condition to be
IS ABOVE 0.8
. This means the alert will fire when the model is at least 80% confident that the data point is an anomaly. - Configure your notification channels (Email, Slack, PagerDuty, etc.).
Now, you have an alert that doesn’t say “Memory is below 1GB,” but rather, “The system’s memory usage is behaving in a highly unusual way based on the last week of activity.” This is a far more powerful and nuanced signal.
Troubleshooting and Common Pitfalls
I ran into a few issues while setting this up. Hopefully, this saves you some time.
- “No data” or “No results” for the
__ml_anomaly_score
metric?- Wait longer: The model needs to complete its initial training on the historical data. This can take a few minutes to an hour depending on the data volume.
- Check the job name: The metric uses the
job
label you defined in the ML job configuration. My query uses{job="Memory Available Anomaly Detection"}
. Make sure your panel query matches the exact name of your ML job.
- The model is too sensitive/not sensitive enough?
- Adjust the
tolerance
: Go back to your ML job configuration and edit the hyperparameters. Lower the tolerance for more alerts, raise it for fewer. - Lengthen the training window: If your data has strong monthly seasonality, a
7d
window might be too short. Try30d
.
- Adjust the
- The alerts are too noisy?
- Tune your alert condition: Don’t alert on every anomaly (
> 0
). Start with a higher value like> 0.7
or> 0.8
. - Use alert grouping: Configure your alert manager to group anomalies from the same job into a single notification if they happen within a short time frame.
- Tune your alert condition: Don’t alert on every anomaly (
Where to Go From Here
You’ve just scratched the surface. The Grafana ML plugin can do more, like:
- Forecasting: Predict where a metric is heading based on trends and seasonality. Great for capacity planning.
- Multi-Metric Jobs: Train a model on the relationship between several metrics (e.g., CPU, Memory, and IO) for even more complex anomaly detection.
- Custom Labels: Apply the same ML job to multiple time series (e.g., all
instance
of a metric) using grouping.
The goal is to move from reactive firefighting to proactive observation. Your dashboards become not just a display of the present, but an intelligent system that highlights the unexpected, giving you back the time to focus on what actually matters.
FAQ
Q: Is the Grafana Machine Learning plugin free? A: The underlying features are part of Grafana Enterprise, but they are available for free in the Grafana Cloud Free plan and for a unlimited time on a single series in self-hosted instances. For full use on self-hosted Grafana, you need an Enterprise license or the trial.
Q: Can I use this with data sources other than Prometheus?
A: Absolutely! The plugin supports all data sources that support the __ml_analysis_anomaly_score
metric, which includes many of the core time-series datasources like Graphite and InfluxDB.
Q: How computationally expensive is this? Will it slow down my Grafana instance? A: For a moderate number of jobs and metrics, the impact is minimal. The heavy lifting of model training and scoring is handled by the Grafana backend. For very large-scale deployments (thousands of metrics), it’s recommended to monitor your Grafana backend resources.
Q: My data is very spiky. Will anomaly detection still work? A: It can be challenging. The model learns the “normal” pattern, including consistent spikes (like a daily cron job). However, if the spikiness is random and has no pattern, the model will have a hard time establishing a baseline. Choosing the right metric is key.