Prometheus Best Practices: 8 Dos and Don'ts | Better Stack Community (2025)

Prometheus has become a cornerstone of modern monitoring systems,providing powerful insights into the health and performance of applications andinfrastructure.

However, effectively harnessing its capabilities requires more than justdeploying the tool or instrumenting metrics in your application.

You also needto implement some best practices to ensure ensure accurate data collection,efficient querying, and meaningful alerting.

In this article, we'll explore a few of the best practices to make your lifeeasier when monitoring with Prometheus.

1. Follow metric and label naming conventions

While Prometheus does not enforce any strict rules for metric and label names,adhering to established conventions significantly enhances the usability,clarity, and maintainability of your metrics.

Consistent naming practices ensure that metrics are intuitive to work with andreduce confusion when querying or visualizing data. These conventions areoutlined in thePrometheus documentation, withkey recommendations including:

  • Using lowercase characters for both metric names and labels and usingunderscores to separate whole words (http_requests_total).

  • Includingbase units in themetric name where applicable such as _seconds, _bytes, or _total to makethe metric's purpose clear.

  • Metric names should include a single-word prefix that reflects the domain theybelong to, often the application name itself.

  • Applying functions like sum() or avg() across all dimensions of a metricshould produce results that are logical.

2. Don't use high cardinality labels

One common mistake when using Prometheus is overloading metrics with too manyunique label combinations, which leads to an issue known as "cardinalityexplosion."

This occurs when an excessive number of time series is created due to highvariation in label values, making it difficult for Prometheus to efficientlyprocess or store the data.

In extreme cases, this can exhaust memory, causing the server to crash andleaving you without crucial monitoring data.

Suppose you are monitoring an e-commerce application and tracking order statuswith a metric like:

Copied!

order_status_total{status="completed"}order_status_total{status="pending"}order_status_total{status="canceled"}

This is reasonable because the status label has a small, fixed set of values.However, if you decide to add a product_id label to monitor metrics for eachindividual product, the situation changes:

Copied!

order_status_total{status="completed",product_id="1"}order_status_total{status="completed",product_id="2"}order_status_total{status="completed",product_id="3"}. . .order_status_total{status="completed",product_id="999999"}

In this scenario, every unique combination of product_id and statusgenerates a new time series. With thousands or millions of products, the totalnumber of time series can grow exponentially, overwhelming Prometheus's storageand computational limits.

This can result in an out-of-memory (OOM) crash, leaving your monitoring systemnon-functional.

Now, for every possible product_id and status combination, a new time serieswill be created.

To avoid such problems, using labels only when necessary and keep their valueswithin a manageable range. For example, you can replace values like like/product/1234/details/5678 with a general pattern such as/product/{product_id}/details/{detail_id} before using it in a metric label.

3. Track totals and failures instead of successes and failures

When instrumenting applications, it's common to track successes and failures asseparate metrics, like:

Copied!

api_failures_totalapi_successes_total

While this seems logical, it complicates the calculation of derived metrics suchas error rates. For example, calculating the error rate requires an expressionlike:

Copied!

rate(api_failures_total[5m]) / (rate(api_successes_total[5m]) + rate(api_failures_total[5m]))

This query combines both counters to determine the total number of requests,which adds unnecessary complexity and increases the likelihood of mistakes inquery construction.

A better approach is to track the total number of requests and the number offailures:

Copied!

api_requests_totalapi_failures_total

With this setup, calculating the error rate becomes straightforward:

Copied!

rate(api_failures_total[5m]) / rate(api_requests_total[5m])

This structure is not only simpler but also provides flexibility. Derivedmetrics like success rates can be easily computed from these two counters:

Copied!

# success rate1 - (rate(http_requests_failures_total[5m]) / rate(http_requests_total[5m]))

By using api_requests_total to track the total number of operations, you avoidduplication and reduce the cognitive load required to query your data.

This approach also makes your metrics more extensible, as additional labels ordimensions (e.g., status="200", status="500") can be added toapi_requests_total without changing the underlying logic.

4. Always scope your PromQL queries

In Prometheus setups, especially those monitoring multiple microservices, it'scrucial to scope your PromQL queries to avoid unintended metric collisions.

For instance, imagine your primary application database (db-service) tracksqueries using a metric called db_queries_total.

Later, another service, such as cache-service, is introduced and also uses themetric name db_queries_total, but this time to track queries to a cachinglayer.

If your PromQL queries are not scoped, dashboards and alerts designed for thedatabase might inadvertently include metrics from the caching layer. This leadsto misleading graphs, false alerts, and confusion, as identical metric names nowrepresent entirely different concepts.

This type of issue, known as a metric collision, arises when identicalmetric names across different services result in data being conflated ormisinterpreted.

To prevent this, always use label matchers to scope your PromQL queries.Instead of using an unscoped query like:

Copied!

rate(db_queries_total[5m]) > 10

Scope your query to the relevant service:

Copied!

rate(db_queries_total{service="my_database_service"}[5m]) > 10

This ensures the query pulls data only from the intended source. Using labelssuch as service, job, or other identifiers specific to your setup not onlyreduces the risk of conflicts but also improves query accuracy andmaintainability.

The fastest logsearch on the planetBetter Stack lets you see inside any stack, debug any issue, and resolve anyincident.

5. Add time tolerance to your alerts

Prometheus alerting rules support a for clause thatdefines how long a condition must persist before an alert is triggered.

While it might seem convenient to skip this delay, doing so can result in overlysensitive alerts that react to transient issues, causing unnecessary noise andpotentially leading to alert fatigue.

Responders might become desensitized to alerts, making them less responsive togenuine problems.

For instance, even if you use expressions like rate(errors_total[5m]) in youralerting rules, a newly started Prometheus instance may not yet have enough datato calculate accurate averages leading alerts to fire based on incomplete ormisleading information.

For example, consider this rule that triggers on high API latency:

Copied!

alert: HighAPILatencyexpr: histogram_quantile(0.95, sum by (le) (rate(api_request_duration_seconds_bucket[5m]))) > 0.5

Without a for clause, even a brief spike in latency could trigger this alert,creating noise and causing unnecessary disruption. Instead, you can refine therule by adding a time tolerance:

Copied!

alert: HighAPILatencyexpr: histogram_quantile(0.95, sum by (le) (rate(api_request_duration_seconds_bucket[5m]))) > 0.5

for: 10m

This modification ensures that the alert only fires if the high latency persistsfor at least 10 minutes, reflecting sustained performance degradation ratherthan a momentary blip.

6. Handle missing metrics for consistent monitoring

Prometheus excels at tracking metrics over time, but it can stumble when metricswith labels appear and disappear unexpectedly. This can lead to empty queryresults, broken dashboards, and misfiring alerts.

For instance, if you're tracking specific error events through an errors_totalmetric, you may have type label to allow filtering by error type such as:

Copied!

errors_total{type="rate_limit_exceeded"}errors_total{type="timeout"}errors_total{type="internal_server_error"}

If you query a specific error type, such as:

Copied!

sum(rate(errors_total{type="host_unreachable"}[5m]))

This query will only return results if that specific error type has occurred inthe last five minutes. If no such error has occurred, the query will return anempty result.

This "missing metric" problem can disrupt your monitoring in several ways:

  • Dashboards might show empty graphs or "No data" messages.
  • Alerts based on these metrics might not fire, even if the issue exists buthasn't occurred recently enough to register in the time window.

To prevent missing metrics, initialize all possible labeled metrics to zero atapplication startup when the set of label values is known in advance. Forexample, in Go:

Copied!

for _, val := range errorLabelValues { errorsCounter.WithLabelValues(val) // Don't use Inc()}

This ensures that Prometheus always has a baseline metric to query, even if noevents have occurred yet.

In situations where metrics have dynamically generated labels, it may not befeasible to initialize them at startup. In such cases, adjust your PromQLqueries to account for missing metrics using the or operator.

For example, if you're calculating the ratio of specific errors, you couldwrite:

Copied!

(rate(errors_total{type="timeout"}[10m]) or up * 0) / (rate(errors_total[10m]) or up * 0)

This approach replaces missing metrics with a default value of zero, ensuringthat your query remains functional and provides accurate results even whenspecific error types are absent.

7. Preserve important labels in alerting rules

While simplifying Prometheus alerting rules by aggregating away labels mightseem convenient, it can strip away essential context that is crucial fordiagnosing and resolving issues.

Take, for example, a rule that triggers an alert for high memory usage across acluster of servers:

Copied!

alert: HighCPUUsageexpr: avg(node_cpu_seconds_total{mode="idle"}) by (instance) < 0.1

This rule calculates average memory usage for each job and alerts if usageexceeds 90%. However, by aggregating data to the job level, it obscures whichspecific instance is causing the high memory usage.

This lack of detail forces you to investigate dashboards or logs to identify theproblematic instance, adding delays to your response time.

To address this, avoid aggregating away critical labels and include them in youralerting rules and notifications. For instance:

Copied!

alert: HighCPUUsageexpr: node_cpu_seconds_total{mode="idle"} < 0.1labels: severity: warningannotations: summary: "High CPU usage on {{ $labels.instance }}" description: "Instance {{ $labels.instance }} has high CPU usage (idle: {{ $value }})"

This updated rule keeps the instance label to ensure the alert providesimmediate context about which server is experiencing high memory usage.

Including this label in the alert message also means responders can quicklyidentify and address the issue without additional investigation.

8. Have a plan for scaling

Prometheus is a powerful tool for monitoring, but as your infrastructure andapplication complexity grow, you'll need to address the challenges of scale.

Increased services, larger data volumes, and longer retention periods can pushPrometheus to its limits. Anticipating and planning for these challenges ensuresthat your monitoring remains effective and reliable as your environment evolves.

Prometheus, by design, is not horizontally scalable. It is limited to asingle-node architecture, meaning you can only increase capacity throughvertical scaling (e.g., adding more CPU, memory, or storage to the server).However, vertical scaling has its limits. Once you approach those limits,alternative strategies are necessary.

A common approach is federated Prometheus setups, where a "global"Prometheus server aggregates data from regional instances.

If you have several Prometheus servers with each one scraping metrics from asubset of your services, you would then set up a single Prometheus server thatscrapes data from each of the shards and aggregates it in one place.

Alternatively, open-source projects likeThanos andCortex allow you to implementscalable, long-term storage and query aggregation for Prometheus metrics. Thesesolutions go beyond basic federation by enabling global querying, deduplicationof metrics, and cross-cluster metric aggregation while offering support forhigh-availability setups.

If you don't want to do all the work of scaling Prometheus yourself, considerusing a fully-managed Prometheus service likeBetter Stack which provides a hands-offsolution for long-term metric storage and querying.

Final thoughts

Starting with these Prometheus best practices is a strong foundation forbuilding a reliable and scalable monitoring setup. However, regular reviews andongoing improvements are essential to ensure your monitoring adapts to thegrowing complexity of your infrastructure and evolving business requirements.

Thanks for reading, and happy monitoring!

Prometheus Best Practices: 8 Dos and Don'ts | Better Stack Community (1)

Article by

Ayooluwa Isaiah

Ayo is a technical content manager at Better Stack. His passion is simplifying and communicating complex technical ideas effectively. His work was featured on several esteemed publications including LWN.net, Digital Ocean, and CSS-Tricks. When he's not writing or coding, he loves to travel, bike, and play tennis.

Got an article suggestion?Let us know

Prometheus Best Practices: 8 Dos and Don'ts | Better Stack Community (2)

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Prometheus Best Practices: 8 Dos and Don'ts | Better Stack Community (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Pres. Carey Rath

Last Updated:

Views: 6041

Rating: 4 / 5 (41 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Pres. Carey Rath

Birthday: 1997-03-06

Address: 14955 Ledner Trail, East Rodrickfort, NE 85127-8369

Phone: +18682428114917

Job: National Technology Representative

Hobby: Sand art, Drama, Web surfing, Cycling, Brazilian jiu-jitsu, Leather crafting, Creative writing

Introduction: My name is Pres. Carey Rath, I am a faithful, funny, vast, joyous, lively, brave, glamorous person who loves writing and wants to share my knowledge and understanding with you.