Prometheus has become a cornerstone of modern monitoring systems,providing powerful insights into the health and performance of applications andinfrastructure.
However, effectively harnessing its capabilities requires more than justdeploying the tool or instrumenting metrics in your application.
You also needto implement some best practices to ensure ensure accurate data collection,efficient querying, and meaningful alerting.
In this article, we'll explore a few of the best practices to make your lifeeasier when monitoring with Prometheus.
1. Follow metric and label naming conventions
While Prometheus does not enforce any strict rules for metric and label names,adhering to established conventions significantly enhances the usability,clarity, and maintainability of your metrics.
Consistent naming practices ensure that metrics are intuitive to work with andreduce confusion when querying or visualizing data. These conventions areoutlined in thePrometheus documentation, withkey recommendations including:
Using lowercase characters for both metric names and labels and usingunderscores to separate whole words (
http_requests_total
).Includingbase units in themetric name where applicable such as
_seconds
,_bytes
, or_total
to makethe metric's purpose clear.Metric names should include a single-word prefix that reflects the domain theybelong to, often the application name itself.
Applying functions like
sum()
oravg()
across all dimensions of a metricshould produce results that are logical.
2. Don't use high cardinality labels
One common mistake when using Prometheus is overloading metrics with too manyunique label combinations, which leads to an issue known as "cardinalityexplosion."
This occurs when an excessive number of time series is created due to highvariation in label values, making it difficult for Prometheus to efficientlyprocess or store the data.
In extreme cases, this can exhaust memory, causing the server to crash andleaving you without crucial monitoring data.
Suppose you are monitoring an e-commerce application and tracking order statuswith a metric like:
Copied!
order_status_total{status="completed"}order_status_total{status="pending"}order_status_total{status="canceled"}
This is reasonable because the status
label has a small, fixed set of values.However, if you decide to add a product_id
label to monitor metrics for eachindividual product, the situation changes:
Copied!
order_status_total{status="completed",product_id="1"}order_status_total{status="completed",product_id="2"}order_status_total{status="completed",product_id="3"}. . .order_status_total{status="completed",product_id="999999"}
In this scenario, every unique combination of product_id
and status
generates a new time series. With thousands or millions of products, the totalnumber of time series can grow exponentially, overwhelming Prometheus's storageand computational limits.
This can result in an out-of-memory (OOM) crash, leaving your monitoring systemnon-functional.
Now, for every possible product_id
and status
combination, a new time serieswill be created.
To avoid such problems, using labels only when necessary and keep their valueswithin a manageable range. For example, you can replace values like like/product/1234/details/5678
with a general pattern such as/product/{product_id}/details/{detail_id}
before using it in a metric label.
3. Track totals and failures instead of successes and failures
When instrumenting applications, it's common to track successes and failures asseparate metrics, like:
Copied!
api_failures_totalapi_successes_total
While this seems logical, it complicates the calculation of derived metrics suchas error rates. For example, calculating the error rate requires an expressionlike:
Copied!
rate(api_failures_total[5m]) / (rate(api_successes_total[5m]) + rate(api_failures_total[5m]))
This query combines both counters to determine the total number of requests,which adds unnecessary complexity and increases the likelihood of mistakes inquery construction.
A better approach is to track the total number of requests and the number offailures:
Copied!
api_requests_totalapi_failures_total
With this setup, calculating the error rate becomes straightforward:
Copied!
rate(api_failures_total[5m]) / rate(api_requests_total[5m])
This structure is not only simpler but also provides flexibility. Derivedmetrics like success rates can be easily computed from these two counters:
Copied!
# success rate1 - (rate(http_requests_failures_total[5m]) / rate(http_requests_total[5m]))
By using api_requests_total
to track the total number of operations, you avoidduplication and reduce the cognitive load required to query your data.
This approach also makes your metrics more extensible, as additional labels ordimensions (e.g., status="200"
, status="500"
) can be added toapi_requests_total
without changing the underlying logic.
4. Always scope your PromQL queries
In Prometheus setups, especially those monitoring multiple microservices, it'scrucial to scope your PromQL queries to avoid unintended metric collisions.
For instance, imagine your primary application database (db-service
) tracksqueries using a metric called db_queries_total
.
Later, another service, such as cache-service
, is introduced and also uses themetric name db_queries_total
, but this time to track queries to a cachinglayer.
If your PromQL queries are not scoped, dashboards and alerts designed for thedatabase might inadvertently include metrics from the caching layer. This leadsto misleading graphs, false alerts, and confusion, as identical metric names nowrepresent entirely different concepts.
This type of issue, known as a metric collision, arises when identicalmetric names across different services result in data being conflated ormisinterpreted.
To prevent this, always use label matchers to scope your PromQL queries.Instead of using an unscoped query like:
Copied!
rate(db_queries_total[5m]) > 10
Scope your query to the relevant service:
Copied!
rate(db_queries_total{service="my_database_service"}[5m]) > 10
This ensures the query pulls data only from the intended source. Using labelssuch as service
, job
, or other identifiers specific to your setup not onlyreduces the risk of conflicts but also improves query accuracy andmaintainability.
5. Add time tolerance to your alerts
Prometheus alerting rules support a for
clause thatdefines how long a condition must persist before an alert is triggered.
While it might seem convenient to skip this delay, doing so can result in overlysensitive alerts that react to transient issues, causing unnecessary noise andpotentially leading to alert fatigue.
Responders might become desensitized to alerts, making them less responsive togenuine problems.
For instance, even if you use expressions like rate(errors_total[5m])
in youralerting rules, a newly started Prometheus instance may not yet have enough datato calculate accurate averages leading alerts to fire based on incomplete ormisleading information.
For example, consider this rule that triggers on high API latency:
Copied!
alert: HighAPILatencyexpr: histogram_quantile(0.95, sum by (le) (rate(api_request_duration_seconds_bucket[5m]))) > 0.5
Without a for
clause, even a brief spike in latency could trigger this alert,creating noise and causing unnecessary disruption. Instead, you can refine therule by adding a time tolerance:
Copied!
alert: HighAPILatencyexpr: histogram_quantile(0.95, sum by (le) (rate(api_request_duration_seconds_bucket[5m]))) > 0.5for: 10m
This modification ensures that the alert only fires if the high latency persistsfor at least 10 minutes, reflecting sustained performance degradation ratherthan a momentary blip.
6. Handle missing metrics for consistent monitoring
Prometheus excels at tracking metrics over time, but it can stumble when metricswith labels appear and disappear unexpectedly. This can lead to empty queryresults, broken dashboards, and misfiring alerts.
For instance, if you're tracking specific error events through an errors_total
metric, you may have type
label to allow filtering by error type such as:
Copied!
errors_total{type="rate_limit_exceeded"}errors_total{type="timeout"}errors_total{type="internal_server_error"}
If you query a specific error type, such as:
Copied!
sum(rate(errors_total{type="host_unreachable"}[5m]))
This query will only return results if that specific error type has occurred inthe last five minutes. If no such error has occurred, the query will return anempty result.
This "missing metric" problem can disrupt your monitoring in several ways:
- Dashboards might show empty graphs or "No data" messages.
- Alerts based on these metrics might not fire, even if the issue exists buthasn't occurred recently enough to register in the time window.
To prevent missing metrics, initialize all possible labeled metrics to zero atapplication startup when the set of label values is known in advance. Forexample, in Go:
Copied!
for _, val := range errorLabelValues { errorsCounter.WithLabelValues(val) // Don't use Inc()}
This ensures that Prometheus always has a baseline metric to query, even if noevents have occurred yet.
In situations where metrics have dynamically generated labels, it may not befeasible to initialize them at startup. In such cases, adjust your PromQLqueries to account for missing metrics using the or
operator.
For example, if you're calculating the ratio of specific errors, you couldwrite:
Copied!
(rate(errors_total{type="timeout"}[10m]) or up * 0) / (rate(errors_total[10m]) or up * 0)
This approach replaces missing metrics with a default value of zero, ensuringthat your query remains functional and provides accurate results even whenspecific error types are absent.
7. Preserve important labels in alerting rules
While simplifying Prometheus alerting rules by aggregating away labels mightseem convenient, it can strip away essential context that is crucial fordiagnosing and resolving issues.
Take, for example, a rule that triggers an alert for high memory usage across acluster of servers:
Copied!
alert: HighCPUUsageexpr: avg(node_cpu_seconds_total{mode="idle"}) by (instance) < 0.1
This rule calculates average memory usage for each job and alerts if usageexceeds 90%. However, by aggregating data to the job level, it obscures whichspecific instance is causing the high memory usage.
This lack of detail forces you to investigate dashboards or logs to identify theproblematic instance, adding delays to your response time.
To address this, avoid aggregating away critical labels and include them in youralerting rules and notifications. For instance:
Copied!
alert: HighCPUUsageexpr: node_cpu_seconds_total{mode="idle"} < 0.1labels: severity: warningannotations: summary: "High CPU usage on {{ $labels.instance }}" description: "Instance {{ $labels.instance }} has high CPU usage (idle: {{ $value }})"
This updated rule keeps the instance
label to ensure the alert providesimmediate context about which server is experiencing high memory usage.
Including this label in the alert message also means responders can quicklyidentify and address the issue without additional investigation.
8. Have a plan for scaling
Prometheus is a powerful tool for monitoring, but as your infrastructure andapplication complexity grow, you'll need to address the challenges of scale.
Increased services, larger data volumes, and longer retention periods can pushPrometheus to its limits. Anticipating and planning for these challenges ensuresthat your monitoring remains effective and reliable as your environment evolves.
Prometheus, by design, is not horizontally scalable. It is limited to asingle-node architecture, meaning you can only increase capacity throughvertical scaling (e.g., adding more CPU, memory, or storage to the server).However, vertical scaling has its limits. Once you approach those limits,alternative strategies are necessary.
A common approach is federated Prometheus setups, where a "global"Prometheus server aggregates data from regional instances.
If you have several Prometheus servers with each one scraping metrics from asubset of your services, you would then set up a single Prometheus server thatscrapes data from each of the shards and aggregates it in one place.
Alternatively, open-source projects likeThanos andCortex allow you to implementscalable, long-term storage and query aggregation for Prometheus metrics. Thesesolutions go beyond basic federation by enabling global querying, deduplicationof metrics, and cross-cluster metric aggregation while offering support forhigh-availability setups.
If you don't want to do all the work of scaling Prometheus yourself, considerusing a fully-managed Prometheus service likeBetter Stack which provides a hands-offsolution for long-term metric storage and querying.
Final thoughts
Starting with these Prometheus best practices is a strong foundation forbuilding a reliable and scalable monitoring setup. However, regular reviews andongoing improvements are essential to ensure your monitoring adapts to thegrowing complexity of your infrastructure and evolving business requirements.
Thanks for reading, and happy monitoring!
Ayo is a technical content manager at Better Stack. His passion is simplifying and communicating complex technical ideas effectively. His work was featured on several esteemed publications including LWN.net, Digital Ocean, and CSS-Tricks. When he's not writing or coding, he loves to travel, bike, and play tennis.
Got an article suggestion?Let us know