From 4ce5b60c40a349b86e3872c661698d5d36778f73 Mon Sep 17 00:00:00 2001 From: beorn7 Date: Wed, 18 Feb 2026 18:40:04 +0100 Subject: [PATCH 1/2] Revamp histogram and summaries best practices With this update, the best practices document about histogram and summaries finally takes native histograms into account. Signed-off-by: beorn7 --- docs/practices/histograms.md | 656 +++++++++++++++++++++++++---------- 1 file changed, 479 insertions(+), 177 deletions(-) diff --git a/docs/practices/histograms.md b/docs/practices/histograms.md index fb8c242fa..0974766c2 100644 --- a/docs/practices/histograms.md +++ b/docs/practices/histograms.md @@ -3,136 +3,388 @@ title: Histograms and summaries sort_rank: 4 --- -NOTE: This document predates native histograms (added as an experimental -feature in Prometheus v2.40 and becoming stable in v3.8). The intention is to -thoroughly update this document in the foreseeable future. - -Histograms and summaries are more complex metric types. Not only does -a single histogram or summary create a multitude of time series, it is -also more difficult to use these metric types correctly. This section -helps you to pick and configure the appropriate metric type for your -use case. - -## Library support +Histograms and summaries are more complex metric types. For historical reasons, +histograms exist in two variants: classic histograms and native histograms, the +latter even come in a number of sub-variants. This document helps to understand +the difference between all those metric types, how to use them correctly, and +how to pick the right metric type for your use case. + +The most important lesson to learn from this document is simple: If you can, +use native histograms and prefer them over both classic histograms and +summaries. + +Where things start to become tricky is if you find yourself in a situation +where you cannot simply use native histograms. Most commenly, you might have to +work with existing metrics that include classic histograms or summaries, or +maybe the instrumentation library you are using does not support native +histograms yet. Furthermore, there are a few specific use cases where you might +prefer a summary or a classic histogram. + +With this document, you should be able to navigate the related obstacles and +subtleties. + +## Overview + +Historically, a sample in the Prometheus world was just a timestamped floating +point value. This value could be interpreted as a +[counter](/docs/concepts/metric_types/#counter) or as a +[gauge](/docs/concepts/metric_types/#gauge), i.e. most of the time Prometheus +doesn't maintain a notion of “static typing”, and you just have to know what +kind of metric you are dealing with (assisted by the convention that the name +of a counter should end on `_total`). + +But there are more metric types than counters and gauges. In particular, there +is a need to represent distributions of observed values (usually simply called +“observations” in Prometheus terminology). There are fundamentally two +different approaches: + +1. The instrumented program calculates a number of pre-configured quantiles + (e.g. the median or the 90th percentile) over pre-configured time windows + (e.g. the last ten minutes) and exposes them as additional metrics. + Prometheus implements this approach in the form of a metric type called + _summary_. Depending on the used algorithm, the pre-calculated quantiles are + usually very accurate. But the calculation has a resource cost for the + instrumented program. Also, you cannot “recalculate” the quantiles later if + you desire another time window or another percentile, and most importantly, + you cannot aggregate quantiles (e.g. to calculate the total 90th percentile + latency for a service backed by multiple replicated workers). +2. The instrumented program represents the distribution in a more fundamental + way that can later be used to calculate arbitrary quantiles over arbitrary + time windows. Most importantly, the distribution is represented in a way + that can be aggregated with each other. This kind of representation is + sometimes called a _digest_. Prometheus implements this approach in the form + of a metric type called _histogram_, where observations are counted in + buckets, as you might know it from the general concept of a + [histogram](https://en.wikipedia.org/wiki/Histogram). + +In both approaches, Prometheus also collects the count and the sum of +observations (see details [below](#count-and-sum-of-observations)). + +Common to both approaches is the need to collect a whole lot of numerical +values per sample, not just a single floating point value as before: + +- In any case the count and sum of observations. +- In the case of summaries the pre-calculated quantiles. +- In the case of histograms a set of buckets with their population counts and + boundaries. + +The new types of metrics are also called _composite types_. + +In a first approach, Prometheus preserved its data model of simple timestamped +floating point values and mapped this multitude of values into one time series +each, distinguished by specific labels. In this way, summaries and classic +histograms were created. In both, the count and sum of observations are each +tracked in a separate time series. Similarly, each pre-calculated quantile of a +summary and each bucket of a histogram is tracked in its own time series. +PromQL operators and functions act on these individual time series, as +explained in detail further below. + +On the one hand, this approach has worked quite well. While keeping the data +model simple, it satisfies many use cases. On the other hand, it suffers from +many limitations, especially when it comes to histograms. Thus, much later in +Prometheus's lifetime, native histograms were introduced. A native histogram +sample is a “composition of values”, where a single sample contains the count +and sum of observations and a dynamic number of buckets with their population +count and boundaries. In the Prometheus TSDB, one histogram results in one time +series of native histogram samples rather than a bunch of independent time +series. PromQL operators and functions now have to act on these composite +samples rather than on the individual time series of floats before. + +You can read everything about native histograms in their +[specification](/docs/specs/native_histograms/), but be warned that this is a +very technical and detailed document. If you read on here, you can expect a +more digestible and usage-focused explanation. + +If you are interested in Prometheus's journey towards native representation of +composite types, you can read more in a [blog +post](/blog/2026/02/14/modernizing-prometheus-composite-samples/). + +## Instrumentation library support First of all, check the library support for [histograms](/docs/concepts/metric_types/#histogram) and [summaries](/docs/concepts/metric_types/#summary). -Some libraries support only one of the two types, or they support summaries -only in a limited fashion (lacking [quantile calculation](#quantiles)). +Summaries are usually supported by all libraries, but some might only track the +count and sum of observations and omit the [quantile calculation](#quantiles). +(Quantile-less summaries is still a legitimate use of summaries, see below.) + +Classic histogram support is also widespread, but native histogram support is +still rare. Currently, the latter requires exposition via the protobuf format, +limiting the support to protobuf-enabled libraries, like the Java and the Go +library. Support in a text-based format is underway as part of OpenMetrics v2. +Things should be moving very soon, so definitely check what your library has to +offer. + +Even if your instrumented program only exposes classic histogram, you can +configure Prometheus to ingest them as native histograms anyway. This will +happen in the form of _Native Histograms with Custom Bucket boundaries_ (NHCB). +These NHCBs have some limitations compared to the usual native histograms +(which feature so-called standard exponential buckets), but they are still much +more efficient to store than pure classic histograms. NHCBs handling in PromQL +is the same as for other native histograms, so a later migration to “real” +native histograms will be easy. + +## Ingestion via Open Telemetry + +Maybe you aren't even using a Prometheus instrumentation library, but your +metrics come from a collector adhering to the Open Telemetry (OTel) standard. +When ingesting OTel metrics into a Prometheus-compatible backends, the “normal” +OTel histograms can be converted into classic histograms or NHCBs on the +Prometheus side (hint: prefer the latter), while OTel's _exponential histograms_ +are always converted into the usual native histograms (with standard +exponential buckets). ## Count and sum of observations -Histograms and summaries both sample observations, typically request -durations or response sizes. They track the number of observations -*and* the sum of the observed values, allowing you to calculate the -*average* of the observed values. Note that the number of observations -(showing up in Prometheus as a time series with a `_count` suffix) is -inherently a counter (as described above, it only goes up). The sum of -observations (showing up as a time series with a `_sum` suffix) -behaves like a counter, too, as long as there are no negative -observations. Obviously, request durations or response sizes are -never negative. In principle, however, you can use summaries and -histograms to observe negative values (e.g. temperatures in -centigrade). In that case, the sum of observations can go down, so you -cannot apply `rate()` to it anymore. In those rare cases where you need to -apply `rate()` and cannot avoid negative observations, you can use two -separate summaries, one for positive and one for negative observations -(the latter with inverted sign), and combine the results later with suitable -PromQL expressions. - -To calculate the average request duration during the last 5 minutes -from a histogram or summary called `http_request_duration_seconds`, -use the following expression: +Histograms and summaries both sample observations, typically request durations +or response sizes. In all variants (even quantile-less summaries), they track +the number of observations *and* the sum of the observed values, allowing you +to calculate the *average* of the observed values. + +To do so, you generally first take a `rate` over the desired duration and then +divide the “rate of the sum” by the “rate of the count”. + +For a native histogram (including an NHCB), you extract the sum and count of +observations with the functions `histogram_sum` and `histogram_count`, +respectively. For example, to calculate the average request duration over the +last 5m from a native histogram called `http_request_duration_seconds`, use the +following PromQL expression: + + histogram_sum(rate(http_request_duration_seconds[5m])) + / + histogram_count(rate(http_request_duration_seconds[5m])) + +In the case of a summary or a classic histogram, you have separate time series +for the sum and count of observations, marked by the magic suffixes `_sum` and +`_count`, respectively. Thus, a summary or classic histogram called +`http_request_duration_seconds` will result in the series +`http_request_duration_seconds_sum` and `http_request_duration_seconds_count`, +and the expression to calculate the average request duration over the last 5m +will look like this: rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m]) -## Apdex score - -A straight-forward use of histograms (but not summaries) is to count -observations falling into particular buckets of observation -values. +The denominator in both expressions above is also useful on its own. It +represents the requests per second served over the last 5m. Another way of +putting it is that the `http_request_duration_seconds_count` series behaves +exactly like a counter for the HTTP requests (which you would call +`http_requests_total` if you did not already have the histogram or summary to +replace it). The key property of a counter in Prometheus is that it always goes +up, unless there is a counter reset. + +If your observations are never negative, the +`http_request_duration_seconds_sum` series also always goes up (unless there is +a counter reset). However, if negative observations are in the mix, the sum of +observations may also go down, breaking assumptions made by PromQL. Such a drop +would erroneously be considered a counter reset in the +`rate(http_request_duration_seconds_sum[5m])` calculation above, throwing off +the result. Note that this problem only affects summaries and classic +histograms. Native histograms (including NHCBs) are `rate`'d as a whole, +thereby detecting counter resets correctly. In the rare cases where you cannot +avoid negative observations and are stuck with summaries or classic histograms, +you can use two separate summaries or histograms, one for positive and one for +negative observations (the latter with inverted sign), and combine the results +later with suitable PromQL expressions. + +Both the sum and count of observations are additive, so you can easily +aggregate – after the rate, but before the division, and no matter what the +underlying metric type is. These are the expressions to calculate the average +request duration for each `job`: + +Native histograms: + + sum by (job) (histogram_sum(rate(http_request_duration_seconds[5m]))) + / + sum by (job) (histogram_count(rate(http_request_duration_seconds[5m]))) -You might have an SLO to serve 95% of requests within 300ms. In that -case, configure a histogram to have a bucket with an upper limit of -0.3 seconds. You can then directly express the relative amount of -requests served within 300ms and easily alert if the value drops below -0.95. The following expression calculates it by job for the requests -served in the last 5 minutes. The request durations were collected with -a histogram called `http_request_duration_seconds`. +Summaries or classic histograms: - sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (job) + sum by (job) (rate(http_request_duration_seconds_sum[5m])) / - sum(rate(http_request_duration_seconds_count[5m])) by (job) + sum by (job) (rate(http_request_duration_seconds_count[5m])) + +## Bucketing + +Histograms are essentially bucketed counters, so the most obvious use case that +separates histograms from summaries is to count observations falling into +particular buckets of observation values. + +If you instrument code with classic histograms, you will configure fixed bucket +boundaries. If you let Prometheus ingest these classic histograms in the +classic way, each bucket configured in that way will create a series suffixed +with `_bucket`, no matter if the bucket is populated or not. More buckets give +you more options and accuracy in the various queries (see below), but the “one +series per bucket” cost is quite significant. + +If you ingest the classic histograms as NHCBs, unpopulated buckets have a +negligible cost, and even populated ones are handled in a more efficient way +because each NHCB is represented by a single series of composite samples +(rather than by a separate series of floats for each bucket and the sum and +count of observations). + +However, picking the right buckets in advance can be challenging. And changing +buckets later creates a lot of disruption (as you will see below). If you +instrument code directly with native histograms, you do not pick any bucket +boundaries explicitly, but you configure a desired resolution. Buckets are +created dynamically following an exponential bucketing schema, covering the +whole range of floating point numbers from -Inf to +Inf. Higher resolution +causes higher resource usage, but generally you can reach much higher resolution +than with classic histograms for the same resource cost. Instrumentation +libraries also offer various strategies to limit the count of populated +buckets, like occasional resets of the histogram or adaptive resolution +reduction. See the documentation of the instrumentation library you are using +for details. + +To query the fraction of observations falling into a certain range based on a +native histogram, use an expression like the following: + + histogram_fraction(0, 0.3, sum by (job) (rate(http_request_duration_seconds[5m]))) + +This calculates the fraction of HTTP requests for each `job` that lasted +between 0ms and 300ms in the last 5m. (300ms are represented here as `0.3` +seconds as you should always use base units in Prometheus.) Note how the `sum` +correctly aggregates by summing up the corresponding buckets in the involved +histograms. If the histograms have different bucket layouts, they are +reconciled first. With the usual exponential bucketing schema, this works +smoothly, essentially by falling back to the lowest common resolution among all +involved histograms. The same is done to reconcile different bucket layouts +over time (in the 5m range that is used in the `rate` calculation). With NHCBs, +the effects of this depend heavily on the details of the different bucket +layouts. It is well possible that the reconciled aggregated histogram has just +one bucket left, containing all observations. Because of the potentially severe +effects, the query result gets an info-level annotation if NHCBs needed to be +reconciled. This is also one of the reasons why native histograms with the +dynamic exponential buckets are much easier to handle. + +The calculated fraction is accurate if there happens to be a bucket boundary +precisely at 0.3. In the common case that there is not, interpolation is used +to return an estimated fraction. This estimation is more accurate with higher +bucket resolutions. If you already know in advance that, for example, you have +an SLO to serve 95% of requests within 300ms, you could use the fixed bucket +boundaries of a classic histogram to allow an accurate calculation. However, if +your SLO changes later, changing the fixed bucket layout accordingy will be +quite tedious. (You have to change the instrumentation of your code. And you +will run into the issues reconciling different bucket layouts as described +above.) If you pick native histograms with the dynamic exponential buckets, you +won't get a bucket boundary at exactly 0.3, but with a decent resolution, the +interpolated estimate will still be quite accurate. In return, you gain the +freedom of changing the range boundaries at will, which is not only helpful if +your SLO changes, but also to explore questions like “Could we maintain a +stricter SLO based on the data of the last quarter?”. + +In the pure legacy case of classic histograms that were also ingested as +classic histograms, the corresponding PromQL expression looks quite different: + + sum by (job) (rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) + / + sum by (job) (rate(http_request_duration_seconds_count[5m])) +The `le` label name stands for “less or equal”. This label's value is the upper +inclusive boundary of a cumulative bucket (i.e. this bucket contains all +observations less than or equal to 0.3 – including negative observations, which +we assume wouldn't happen in the case of observing request durations). -You can approximate the well-known [Apdex -score](http://en.wikipedia.org/wiki/Apdex) in a similar way. Configure -a bucket with the target request duration as the upper bound and -another bucket with the tolerated request duration (usually 4 times -the target request duration) as the upper bound. Example: The target -request duration is 300ms. The tolerable request duration is 1.2s. The -following expression yields the Apdex score for each job over the last -5 minutes: +Note that this expression strictly requires a bucket boundary configured at +0.3. If the histograms involved do not have a bucket with that boundary, no +interpolation is applied. Instead of an estimation, no result is returned at +all. If only some of the involved histograms have such a bucket, an incomplete +result is returned, but without any warning, which is a pretty bad situation to +be in. (Hint: Avoid this “purely classic” case. If you can, ingest classic +histograms as NHCB. Or instrument with native histograms in the first place.) - ( - sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (job) - + - sum(rate(http_request_duration_seconds_bucket{le="1.2"}[5m])) by (job) - ) / 2 / sum(rate(http_request_duration_seconds_count[5m])) by (job) +## Apdex score -Note that we divide the sum of both buckets. The reason is that the histogram -buckets are -[cumulative](https://en.wikipedia.org/wiki/Histogram#Cumulative_histogram). The -`le="0.3"` bucket is also contained in the `le="1.2"` bucket; dividing it by 2 -corrects for that. +When reading about fractions of requests served within a certain duration +range, you might remember the [Apdex +score](http://en.wikipedia.org/wiki/Apdex). For this score, you set a target +request duration and a tolerated request duration (usually 4 times the target +request duration). Let's say your target request duration is 300ms and the +tolerable request duration is 1.2s. If you want to calculate the Apdex score by +`job` over the last 5m, the PromQL expression for native histograms (including +NHCB) is straightforward. Simply add the fraction of requests within your +duration target to half of the fraction of requests with a duration between the +target and the tolerated duration: + + histogram_fraction(0, 0.3, sum by (job) (rate(http_request_duration_seconds[5m]))) + + + histogram_fraction(0.3, 1.2, sum by (job) (rate(http_request_duration_seconds[5m]))) / 2 + +In the “pure classic” case, you _must_ have buckets present at the exact +boundaries (giving you an accurace calculation in return). The corresponding +PromQL expression looks quite different because the classic buckets are +cumulative: + + ( + sum by (job) (rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) + + + sum by (job) (rate(http_request_duration_seconds_bucket{le="1.2"}[5m])) + ) + / + 2 + / + sum by (job) (rate(http_request_duration_seconds_count[5m])) -The calculation does not exactly match the traditional Apdex score, as it -includes errors in the satisfied and tolerable parts of the calculation. +(For the sake of simplicity, the above expressions do not explicitly exclude +failed requests from the satisfied and tolerated parts of the calculation, as +it would be required for a strictly correct Apdex calculation.) ## Quantiles You can use both summaries and histograms to calculate so-called φ-quantiles, where 0 ≤ φ ≤ 1. The φ-quantile is the observation value that ranks at number φ*N among the N observations. Examples for φ-quantiles: The 0.5-quantile is -known as the median. The 0.95-quantile is the 95th percentile. +known as the median. The 0.95-quantile is also called the 95th percentile. The essential difference between summaries and histograms is that summaries -calculate streaming φ-quantiles on the client side and expose them directly, -while histograms expose bucketed observation counts and the calculation of -quantiles from the buckets of a histogram happens on the server side using the -[`histogram_quantile()` +calculate streaming φ-quantiles within the instrumented program and expose them +directly, while histograms expose bucketed observation counts and the +calculation of quantiles from the buckets of a histogram happens on the +Prometheus server using the [`histogram_quantile()` function](/docs/prometheus/latest/querying/functions/#histogram_quantile). - -The two approaches have a number of different implications: - -| | Histogram | Summary -|---|-----------|--------- -| Required configuration | Pick buckets suitable for the expected range of observed values. | Pick desired φ-quantiles and sliding window. Other φ-quantiles and sliding windows cannot be calculated later. -| Client performance | Observations are very cheap as they only need to increment counters. | Observations are expensive due to the streaming quantile calculation. -| Server performance | The server has to calculate quantiles. You can use [recording rules](/docs/prometheus/latest/configuration/recording_rules/#recording-rules) should the ad-hoc calculation take too long (e.g. in a large dashboard). | Low server-side cost. -| Number of time series (in addition to the `_sum` and `_count` series) | One time series per configured bucket. | One time series per configured quantile. -| Quantile error (see below for details) | Error is limited in the dimension of observed values by the width of the relevant bucket. | Error is limited in the dimension of φ by a configurable value. -| Specification of φ-quantile and sliding time-window | Ad-hoc with [Prometheus expressions](/docs/prometheus/latest/querying/functions/#histogram_quantile). | Preconfigured by the client. -| Aggregation | Ad-hoc with [Prometheus expressions](/docs/prometheus/latest/querying/functions/#histogram_quantile). | In general [not aggregatable](http://latencytipoftheday.blogspot.de/2014/06/latencytipoftheday-you-cant-average.html). - -Note the importance of the last item in the table. Let us return to -the SLO of serving 95% of requests within 300ms. This time, you do not -want to display the percentage of requests served within 300ms, but -instead the 95th percentile, i.e. the request duration within which -you have served 95% of requests. To do that, you can either configure -a summary with a 0.95-quantile and (for example) a 5-minute decay -time, or you configure a histogram with a few buckets around the 300ms -mark, e.g. `{le="0.1"}`, `{le="0.2"}`, `{le="0.3"}`, and -`{le="0.45"}`. If your service runs replicated with a number of -instances, you will collect request durations from every single one of -them, and then you want to aggregate everything into an overall 95th -percentile. However, aggregating the precomputed quantiles from a -summary rarely makes sense. In this particular case, averaging the -quantiles yields statistically nonsensical values. +Histograms are further divided into native and classic histograms. The +following table lists some implications of the different approaches. + +| | Native Histogram | Classic Histogram | Summary +|---|------------------|-------------------|--------- +| Required configuration during instrumentation | Pick a desired resolution and maybe a strategy to limit the bucket count. | Pick buckets suitable for the expected range of observed values and the desired queries. | Pick desired φ-quantiles and sliding window. Other φ-quantiles and sliding windows cannot be calculated later. +| Instrumentation cost | Observations are cheap as they only need to increment counters. | Observations are cheap as they only need to increment counters. | Observations are relatively expensive due to the streaming quantile calculation. +| Query performance | The server has to calculate quantiles from complex histogram samples. You can use [recording rules](/docs/prometheus/latest/configuration/recording_rules/#recording-rules) should the ad-hoc calculation take too long (e.g. in a large dashboard). | The server has to calculate quantiles from a large number of bucket series. You can use [recording rules](/docs/prometheus/latest/configuration/recording_rules/#recording-rules) should the ad-hoc calculation take too long (e.g. in a large dashboard). | Fast (no quantile calculations on the server, and aggregations are impossible anyway, see below). +| Number of time series per histogram/summary | One (with a composite sample type). | `_sum`, `_count`, one per configured bucket. | `_sum`, `_count`, one per configured quantile. +| Quantile error (see below for details) | Limited by the configured resolution. | Error is limited by the width of the bucket the quantile is located in. | Configurable, generally very low. +| Specification of φ-quantile and sliding time-window | Ad-hoc with [PromQL expression](/docs/prometheus/latest/querying/functions/#histogram_quantile). | Ad-hoc with [PromQL expression](/docs/prometheus/latest/querying/functions/#histogram_quantile). | Preconfigured during instrumentation. +| Aggregation | Ad-hoc with [PromQL expression](/docs/prometheus/latest/querying/functions/#histogram_quantile), buckets are always compatible. | Ad-hoc with [PromQL expression](/docs/prometheus/latest/querying/functions/#histogram_quantile), provided there are no changes in bucket boundaries. | [Not aggregatable](http://latencytipoftheday.blogspot.de/2014/06/latencytipoftheday-you-cant-average.html). + +As mentioned above, classic histograms can be ingested by the Prometheus server +as a special form of native histograms, called NHCBs (Native Histograms with +Custom Bucket boundaries). Therefore, they share some implications with classic +histograms and some with the usual native histograms. On the instrumentation +side, they behave exactly like classic histograms. (In fact, they are identical +to classic histograms, as NHCBs are only created on the server side when a +classic histogram is ingested as an NHCB.) The query performance and number of +time series is the same as for the usual native histograms, but the quantile +error is the same as with a corresponding classic histogram. NHCBs treat a +change of the bucket layout a bit more gracefully than classic histograms, but +it is still a problematic situation (which is at least flagged as such by an +annotation). + +Note the importance of the last item in the table. Let us return to the SLO of +serving 95% of requests within 300ms. This time, you do not want to display the +percentage of requests served within 300ms, but instead the 95th percentile, +i.e. the request duration within which you have served 95% of requests. To do +that, you can either configure a summary with a 0.95-quantile and (for example) +a 5-minute decay time, or you configure a native histogram with a decent +resolution (for example, with the Go instrumentation library, you could use a +value of 1.1 for the `NativeHistogramBucketFactor`), or you configure a classic +histogram with a few buckets around the 300ms mark, e.g. `{le="0.1"}`, +`{le="0.2"}`, `{le="0.3"}`, and `{le="0.45"}`. If your service runs replicated +with a number of instances, you will collect request durations from every +single one of them, and then you want to aggregate everything into an overall +95th percentile. However, aggregating the precomputed quantiles from a summary +rarely makes sense. In this particular case, averaging the quantiles yields +statistically nonsensical values. avg(http_request_duration_seconds{quantile="0.95"}) // BAD! @@ -140,98 +392,148 @@ Using histograms, the aggregation is perfectly possible with the [`histogram_quantile()` function](/docs/prometheus/latest/querying/functions/#histogram_quantile). - histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) // GOOD. +Native histogram version (including NHCB): + + histogram_quantile(0.95, sum(rate(http_request_duration_seconds[5m]))) // GOOD. + +Classic histogram version: + + histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))) // GOOD. Furthermore, should your SLO change and you now want to plot the 90th percentile, or you want to take into account the last 10 minutes -instead of the last 5 minutes, you only have to adjust the expression +instead of the last 5 minutes, you only have to adjust the expressions above and you do not need to reconfigure the clients. -## Errors of quantile estimation +### Errors of quantile estimation -Quantiles, whether calculated client-side or server-side, are -estimated. It is important to understand the errors of that +Quantiles, whether calculated by the instrumented binary or on the Prometheus +server, are estimated. It is important to understand the errors of that estimation. -Continuing the histogram example from above, imagine your usual -request durations are almost all very close to 220ms, or in other -words, if you could plot the "true" histogram, you would see a very -sharp spike at 220ms. In the Prometheus histogram metric as configured -above, almost all observations, and therefore also the 95th percentile, -will fall into the bucket labeled `{le="0.3"}`, i.e. the bucket from -200ms to 300ms. The histogram implementation guarantees that the true -95th percentile is somewhere between 200ms and 300ms. To return a -single value (rather than an interval), it applies linear -interpolation, which yields 295ms in this case. The calculated -quantile gives you the impression that you are close to breaching the -SLO, but in reality, the 95th percentile is a tiny bit above 220ms, -a quite comfortable distance to your SLO. +Continuing the histogram example from above, imagine your usual request +durations are almost all very close to 220ms, or in other words, in a histogram +with very high resolution, you would see a very sharp spike at 220ms, and the +“true” 95th percentile is also close to 220ms. + +With the `NativeHistogramBucketFactor` of 1.1 (following the Go instrumentation +example), the bucket this spike would fall into has a lower boundary of +approximately 0.210 and an upper boundary of approximately 0.229. (This +document deliberately avoids to explain the details how these boundaries are +calculated. See the aforementioned [spec](/docs/specs/native_histograms/) for +details.) To keep things simple, let's assume that indeed _all_ request fall +into this bucket. The interpolation logic of `histogram_quantile` will then +estimate the 95th percentile to be 228ms (again glossing over the details of +the calculation here). However, given the bucket boundaries above, the true +value could be anywhere between 210ms and 229ms, depending on the actual +distribution of requests within the bucket. So this is a fairly accurate +estimation, even in the worst case (the true value could be 210ms rather than +220ms vs. the estimated value of 228ms). + +Now let's apply the same to the classic histogram configured as described +above. All observations, and therefore also the 95th percentile, will fall into +the bucket labeled `{le="0.3"}`, i.e. the bucket from 200ms to 300ms. The +interpolation would estimate 295ms in this case, with the guarantee that the +true value is between 200ms and 300ms. Not only is the error margin much +larger, also the estimated value of 295ms is much farther away from the true +value of 220ms than in case of the native histogram, where the estimation was +228ms. Given that the SLO is at 300ms for the 95th percentile, the classic +histogram gives you the impression that you are very close to breaching it, but +in reality you are still doing quite well. Next step in our thought experiment: A change in backend routing adds a fixed amount of 100ms to all request durations. Now the request -duration has its sharp spike at 320ms and almost all observations will -fall into the bucket from 300ms to 450ms. The 95th percentile is -calculated to be 442.5ms, although the correct value is close to -320ms. While you are only a tiny bit outside of your SLO, the -calculated 95th quantile looks much worse. - -A summary would have had no problem calculating the correct percentile -value in both cases, at least if it uses an appropriate algorithm on -the client side (like the [one used by the Go -client](http://dimacs.rutgers.edu/~graham/pubs/slides/bquant-long.pdf)). -Unfortunately, you cannot use a summary if you need to aggregate the +duration has its sharp spike at 320ms. + +The relevant bucket of the native histogram ranges from 297ms to 324ms (again +just stating numbers here without telling you how they are calculated), with +the interpolated estimation for the 95th percentile being 323ms. That's an +almost perfect guess. + +The classic histogram, however, will see almost all observations in the bucket +from 300ms to 450ms. The 95th percentile is estimated to be 443ms, far away +from the correct value close to 320ms. While you are only a tiny bit outside of +your SLO, the estimated 95th quantile looks much worse. + +A summary would have had no problem calculating the correct percentile value +very accurately in both cases, at least if it uses an appropriate algorithm +(like the [one used by the Go instrumentation +library](http://dimacs.rutgers.edu/~graham/pubs/slides/bquant-long.pdf) – this +algorithm will yield very accurate results for narrow distributions as in our +example). Unfortunately, you cannot use a summary if you need to aggregate the observations from a number of instances. -Luckily, due to your appropriate choice of bucket boundaries, even in -this contrived example of very sharp spikes in the distribution of -observed values, the histogram was able to identify correctly if you -were within or outside of your SLO. Also, the closer the actual value -of the quantile is to our SLO (or in other words, the value we are -actually most interested in), the more accurate the calculated value +Luckily, due to your appropriate choice of bucket boundaries for the clasic +histogram, in this contrived example of very sharp spikes in the distribution +of observed values, the classic histogram was able to identify correctly if you +were within or outside of your SLO (although it was bad in telling you how far +away you were from breaching or keeping the SLO). However, the closer the +actual value of the quantile is to the SLO (or in other words, the value you +are actually most interested in), the more accurate the calculated value becomes. -Let us now modify the experiment once more. In the new setup, the -distributions of request durations has a spike at 150ms, but it is not -quite as sharp as before and only comprises 90% of the -observations. 10% of the observations are evenly spread out in a long -tail between 150ms and 450ms. With that distribution, the 95th -percentile happens to be exactly at our SLO of 300ms. With the -histogram, the calculated value is accurate, as the value of the 95th -percentile happens to coincide with one of the bucket boundaries. Even -slightly different values would still be accurate as the (contrived) -even distribution within the relevant buckets is exactly what the -linear interpolation within a bucket assumes. - -The error of the quantile reported by a summary gets more interesting -now. The error of the quantile in a summary is configured in the -dimension of φ. In our case we might have configured 0.95±0.01, -i.e. the calculated value will be between the 94th and 96th -percentile. The 94th quantile with the distribution described above is -270ms, the 96th quantile is 330ms. The calculated value of the 95th -percentile reported by the summary can be anywhere in the interval -between 270ms and 330ms, which unfortunately is all the difference -between clearly within the SLO vs. clearly outside the SLO. +Let us now modify the experiment once more. In the new setup, the distributions +of request durations has a spike at 150ms, but it is not quite as sharp as +before and only comprises 90% of the observations. 10% of the observations are +evenly spread out in a long tail between 150ms and 450ms. With that +distribution, the 95th percentile happens to be exactly at our SLO of 300ms. +With the classic histogram, the calculated value would be accurate in this +(contrived) case, as the value of the 95th percentile happens to coincide with +one of the configured bucket boundaries. Even slightly different values would +still be accurate as the even distribution within the relevant buckets is +exactly what the interpolation algorithm for classic histograms assumes. + +The error of the quantile reported by a summary gets more interesting here. In +the case of the Go instrumentation library, the error of the quantile in a +summary is configured in the dimension of φ. In our case we might have +configured 0.95±0.01, i.e. the calculated value will be between the 94th and +96th percentile. The 94th quantile with the distribution described above is +270ms, the 96th quantile is 330ms. The calculated value of the 95th percentile +reported by the summary can be anywhere in the interval between 270ms and +330ms, which unfortunately is all the difference between clearly within the SLO +vs. clearly outside the SLO. The bottom line is: If you use a summary, you control the error in the -dimension of φ. If you use a histogram, you control the error in the -dimension of the observed value (via choosing the appropriate bucket -layout). With a broad distribution, small changes in φ result in -large deviations in the observed value. With a sharp distribution, a -small interval of observed values covers a large interval of φ. - -Two rules of thumb: - - 1. If you need to aggregate, choose histograms. - - 2. Otherwise, choose a histogram if you have an idea of the range - and distribution of values that will be observed. Choose a - summary if you need an accurate quantile, no matter what the - range and distribution of the values is. - - -## What can I do if my client library does not support the metric type I need? +dimension of φ. If you use a histogram, you control the error in the dimension +of the observed value, via choosing the appropriate bucket layout in case of +the classic histogram (tough) or via choosing a bucket resolution in case of a +native histogram (easy). With a broad distribution, small changes in φ result +in large deviations in the observed value. With a sharp distribution, a small +interval of observed values covers a large interval of φ. + +The rules of thumb are the following: + + 1. If you have access to native histograms, use them with a resolution that + matches your accuracy requirements. This combines the required accuracy + with the ability to aggregate and to change parameters (percentile, + sliding window) ad hoc via the PromQL expression. + 2. If you cannot use native histograms, but you need aggregations, you have + to use classic histograms, which requires you to set appropriate bucket + boundaries, covering the correct range of values and finding the right + trade-off between the cost for the buckets and the required accuracy. + 3. Only if aggregation isn't needed, you can start thinking about summaries. + The main advantage is that it gives you very accurate quantile estimation + (in the dimension of φ) at relatively low overall cost. However, the + additional requirement to pick the desired quantiles and sliding window at + instrumentation time is another severe drawback of summaries. + +## Visualization + +While the pre-calculated quantiles of a summary can be visualized as any other +time series of floats, visualizing a histogram is more complex. The Prometheus +UI shows a graphical representation of a single histogram sample in the _Table_ +view. However, in the _Graph_ view, it simply plots each component series of a +classic histogram or – in case of a native histogram – only the sum of +observations. + +A very useful visualization of a histogram over time is a heatmap. The +Prometheus UI does not support heatmaps yet (see [tracking +issue](https://github.com/prometheus/prometheus/issues/15346)). However, +popular dashboarding tools like +[Perses](https://perses.dev/plugins/docs/heatmapchart/) or +[Grafana](https://grafana.com/docs/grafana/latest/visualizations/panels-visualizations/visualizations/heatmap/) +are able to render heatmaps based on Prometheus histograms. The resolution of +classic histograms is usually not high enough to create compelling heatmaps, +but the higher resolution reachable with native histograms results in very +detailed heatmaps. -Implement it! [Code contributions are welcome](/community/). In general, we -expect histograms to be more urgently needed than summaries. Histograms are -also easier to implement in a client library, so we recommend to implement -histograms first, if in doubt. From 6ec3723ec12bc6955ae3f97bb2e296d54d401b8c Mon Sep 17 00:00:00 2001 From: beorn7 Date: Tue, 10 Mar 2026 19:10:12 +0100 Subject: [PATCH 2/2] Update metric_types.md for native histograms MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Note that I used this opportunity to replace the term "client library" with "instrumentation library". I always thought that "client library" is confusing as it is not implementing a client in any way. (Technically, it implements a _server_, of which the Prometheus "server" is the client… :exploding_head:) Even if we accept that "Prometheus client library" just means "a library to do something that has to do with Prometheus", the title "client library" still doesn't tell us what the library is actually for. (Note that the client_golang repository not only contains an instrumentation library, but also includes an _actual_ client library that helps you to implement clients that talk to the Prometheus HTTP API.) Signed-off-by: beorn7 --- docs/concepts/metric_types.md | 113 ++++++++++++++++++++++++---------- docs/practices/histograms.md | 6 +- 2 files changed, 83 insertions(+), 36 deletions(-) diff --git a/docs/concepts/metric_types.md b/docs/concepts/metric_types.md index 3b37d7f07..3b9e263b5 100644 --- a/docs/concepts/metric_types.md +++ b/docs/concepts/metric_types.md @@ -3,11 +3,17 @@ title: Metric types sort_rank: 2 --- -The Prometheus client libraries offer four core metric types. These are -currently only differentiated in the client libraries (to enable APIs tailored -to the usage of the specific types) and in the wire protocol. The Prometheus -server does not yet make use of the type information and flattens all data into -untyped time series. This may change in the future. +The Prometheus instrumentation libraries offer four core metric types. With the +exception of native histograms, these are currently only differentiated in the +instrumentation libraries (to enable APIs tailored to the usage of the specific +types) and in the exposition protocols. The Prometheus server does not yet make +use of the type information and flattens all types except native histograms +into untyped time series of floating point values. Native histograms, however, +are ingested as time series of special composite histogram samples. In the +future, Prometheus might handle other metric types as [composite +types](/blog/2026/02/14/modernizing-prometheus-composite-samples/), too. There +is also ongoing work to persist the type information of the simple float +samples. ## Counter @@ -20,7 +26,7 @@ errors. Do not use a counter to expose a value that can decrease. For example, do not use a counter for the number of currently running processes; instead use a gauge. -Client library usage documentation for counters: +Instrumentation library usage documentation for counters: * [Go](http://godoc.org/github.com/prometheus/client_golang/prometheus#Counter) * [Java](https://prometheus.github.io/client_java/getting-started/metric-types/#counter) @@ -38,7 +44,7 @@ Gauges are typically used for measured values like temperatures or current memory usage, but also "counts" that can go up and down, like the number of concurrent requests. -Client library usage documentation for gauges: +Instrumentation library usage documentation for gauges: * [Go](http://godoc.org/github.com/prometheus/client_golang/prometheus#Gauge) * [Java](https://prometheus.github.io/client_java/getting-started/metric-types/#gauge) @@ -51,37 +57,78 @@ Client library usage documentation for gauges: A _histogram_ samples observations (usually things like request durations or response sizes) and counts them in configurable buckets. It also provides a sum -of all observed values. - -A histogram with a base metric name of `` exposes multiple time series -during a scrape: - - * cumulative counters for the observation buckets, exposed as `_bucket{le=""}` +of all observed values. As such, a histogram is essentially a bucketed counter. +However, a histogram can also represent the current state of a distribution, in +which case it is called a _gauge histogram_. In contrast to the usual +counter-like histograms, gauge histograms are rarely directly exposed by +instrumented programs and are thus not (yet) usable in instrumentation +libraries, but they are represented in newer versions of the protobuf +exposition format and in [OpenMetrics](https://openmetrics.io/). They are also +created regularly by PromQL expressions. For example, the outcome of applying +the `rate` function to a counter histogram is a gauge histogram, in the same +way as the outcome of applying the `rate` function to a counter is a gauge. + +Histograms exists in two fundamentally different versions: The more recent +_native histograms_ and the older _classic histograms_. + +A native histogram is exposed and ingested as composite samples, where each +sample represents the count and sum of observations together with a dynamic set +of buckets. + +A classic histogram, however, consists of multiple time series of simple float +samples. A classic histogram with a base metric name of `` results in +the following time series: + + * cumulative counters for the observation buckets, exposed as + `_bucket{le=""}` * the **total sum** of all observed values, exposed as `_sum` - * the **count** of events that have been observed, exposed as `_count` (identical to `_bucket{le="+Inf"}` above) - -Use the -[`histogram_quantile()` function](/docs/prometheus/latest/querying/functions/#histogram_quantile) -to calculate quantiles from histograms or even aggregations of histograms. A -histogram is also suitable to calculate an -[Apdex score](http://en.wikipedia.org/wiki/Apdex). When operating on buckets, -remember that the histogram is -[cumulative](https://en.wikipedia.org/wiki/Histogram#Cumulative_histogram). See -[histograms and summaries](/docs/practices/histograms) for details of histogram -usage and differences to [summaries](#summary). - -NOTE: Beginning with Prometheus v2.40, there is experimental support for native -histograms. A native histogram requires only one time series, which includes a -dynamic number of buckets in addition to the sum and count of -observations. Native histograms allow much higher resolution at a fraction of -the cost. Detailed documentation will follow once native histograms are closer -to becoming a stable feature. + * the **count** of events that have been observed, exposed as + `_count` (identical to `_bucket{le="+Inf"}` above) + +Native histograms are generally much more efficient than classic histograms, +allow much higher resolution, and do not require explicit configuration of +bucket boundaries during instrumentation. Their bucketing schema ensures that +they are always aggregatable with each other, even if the resolution might have +changed, while classic histograms with different bucket boundaries are not +generally aggregatable. If the instrumentation library you are using supports native +histograms (currently this is the case for Go and Java), you should probably +prefer native histograms over classic histograms. + +If you are stuck with classic histograms for whatever reason, there is a way to +get at least some of the benefits of native histograms: You can configure +Prometheus to ingest classic histograms into a special form of native +histograms, called Native Histograms with Custom Bucket boundaries (NHCB). +NHCBs are stored as the same composite samples as usual native histograms with +the same gain in efficiency. However, their buckets are still the same buckets +statically configured during instrumentation, with their limited resolution and +range and the same problems of aggregatability upon changing the bucket +boundaries. + +Use the [`histogram_quantile()` +function](/docs/prometheus/latest/querying/functions/#histogram_quantile) to +calculate quantiles from histograms or even aggregations of histograms. It +works for both classic and native histograms, using a slightly different +syntax. Histograms are also suitable to calculate an [Apdex +score](http://en.wikipedia.org/wiki/Apdex). + +You can operate directly on the buckets of a classic histogram, as they are +represented as individual series (called `_bucket{le=""}` as described above). Remember, however, that these buckets +are [cumulative](https://en.wikipedia.org/wiki/Histogram#Cumulative_histogram), +i.e. every bucket counts all observations less than or equal to the upper +boundary provided as a label. With native histograms, use the +[`histogram_fraction()` +function](/docs/prometheus/latest/querying/functions/#histogram_fraction) to +calculate fractions of observations within given boundaries. + +See [histograms and summaries](/docs/practices/histograms) for details of +histogram usage and differences to [summaries](#summary). NOTE: Beginning with Prometheus v3.0, the values of the `le` label of classic histograms are normalized during ingestion to follow the format of [OpenMetrics Canonical Numbers](https://github.com/prometheus/OpenMetrics/blob/main/specification/OpenMetrics.md#considerations-canonical-numbers). -Client library usage documentation for histograms: +Instrumentation library usage documentation for histograms: * [Go](http://godoc.org/github.com/prometheus/client_golang/prometheus#Histogram) * [Java](https://prometheus.github.io/client_java/getting-started/metric-types/#histogram) @@ -111,7 +158,7 @@ to [histograms](#histogram). NOTE: Beginning with Prometheus v3.0, the values of the `quantile` label are normalized during ingestion to follow the format of [OpenMetrics Canonical Numbers](https://github.com/prometheus/OpenMetrics/blob/main/specification/OpenMetrics.md#considerations-canonical-numbers). -Client library usage documentation for summaries: +Instrumentation library usage documentation for summaries: * [Go](http://godoc.org/github.com/prometheus/client_golang/prometheus#Summary) * [Java](https://prometheus.github.io/client_java/getting-started/metric-types/#summary) diff --git a/docs/practices/histograms.md b/docs/practices/histograms.md index 0974766c2..fc9ca118d 100644 --- a/docs/practices/histograms.md +++ b/docs/practices/histograms.md @@ -401,9 +401,9 @@ Classic histogram version: histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))) // GOOD. Furthermore, should your SLO change and you now want to plot the 90th -percentile, or you want to take into account the last 10 minutes -instead of the last 5 minutes, you only have to adjust the expressions -above and you do not need to reconfigure the clients. +percentile, or you want to take into account the last 10 minutes instead of the +last 5 minutes, you only have to adjust the expressions above and you do not +need to reconfigure the instrumentation of the monitored programs. ### Errors of quantile estimation