From 9fde4c0b2ad94963cf40423dc121bb0653af4cb7 Mon Sep 17 00:00:00 2001 From: Tatiana Inama Date: Mon, 30 Mar 2026 18:34:12 +0200 Subject: [PATCH] docs: add pre-aggregates example and expand limitations section --- references/pre-aggregates/overview.mdx | 58 ++++++++++++++++++++++---- 1 file changed, 51 insertions(+), 7 deletions(-) diff --git a/references/pre-aggregates/overview.mdx b/references/pre-aggregates/overview.mdx index edc3b027..4e0f1ce6 100644 --- a/references/pre-aggregates/overview.mdx +++ b/references/pre-aggregates/overview.mdx @@ -8,7 +8,7 @@ description: "Speed up dashboards and reduce warehouse costs by serving queries **Availability:** Pre-aggregates are an [Early Access](/references/workspace/feature-maturity-levels) feature available on **Enterprise plans** only. -Pre-aggregates let you define materialized summaries of your data directly in your dbt YAML. When a user runs a query in Lightdash, the system checks if the query can be answered from a pre-aggregate instead of querying your warehouse. If it matches, the query is served from the pre-computed results — making it significantly faster and reducing warehouse load. +Pre-aggregates let you define materialized summaries of your data directly in your dbt YAML. When a user runs a query in Lightdash, the system checks if the query can be answered from a pre-aggregate instead of querying your warehouse. If it matches, the query is served from the pre-computed results, making it significantly faster and reducing warehouse load. This is especially useful for dashboards with high traffic or expensive aggregations that don't need real-time data. @@ -30,7 +30,38 @@ Pre-aggregates follow a four-step cycle: 3. **Match** — When a user runs a query, Lightdash checks if every requested dimension, metric, and filter is covered by a pre-aggregate. 4. **Serve** — If a match is found, the query is served from the materialized data instead of hitting your warehouse. -{/* TODO: Add architecture diagram here showing the define → materialize → match → serve cycle */} +### Example + +Suppose you have an `orders` table with thousands of rows, and you define a pre-aggregate with dimensions `status` and metrics `total_amount` (sum) and `order_count` (count), with a `day` granularity on `order_date`. + +**Your warehouse data:** + +| order_date | status | customer | amount | +|---|---|---|---| +| 2024-01-15 | shipped | Alice | $100 | +| 2024-01-15 | shipped | Bob | $50 | +| 2024-01-15 | pending | Charlie | $75 | +| 2024-01-16 | shipped | Alice | $200 | +| 2024-01-16 | pending | Charlie | $30 | +| ... | ... | ... | ... | + +**Lightdash materializes this into a pre-aggregate:** + +| order_date_day | status | total_amount | order_count | +|---|---|---|---| +| 2024-01-15 | shipped | $150 | 2 | +| 2024-01-15 | pending | $75 | 1 | +| 2024-01-16 | shipped | $200 | 1 | +| 2024-01-16 | pending | $30 | 1 | + +Now when a user queries "total amount by status, grouped by **month**", Lightdash re-aggregates from the daily pre-aggregate instead of scanning the full table: + +| order_date_month | status | total_amount | +|---|---|---| +| January 2024 | shipped | $350 | +| January 2024 | pending | $105 | + +This works because `sum` can be re-aggregated — summing daily sums gives the correct monthly sum. ## Query matching @@ -41,7 +72,7 @@ When a user runs a query, Lightdash automatically checks if a pre-aggregate can - Every dimension used in **filters** is included in the pre-aggregate - All metrics use [supported metric types](#supported-metric-types) - The query does not contain custom dimensions, custom metrics, or table calculations -- If the query uses a time dimension, the requested granularity is **equal to or coarser** than the pre-aggregate's granularity (for example, a `day` pre-aggregate can serve `day`, `week`, `month`, or `year` queries — but not `hour`) +- If the query uses a time dimension, the requested granularity is **equal to or coarser** than the pre-aggregate's granularity (for example, a `day` pre-aggregate can serve `day`, `week`, `month`, or `year` queries, but not `hour`) When multiple pre-aggregates match a query, Lightdash picks the smallest one (fewest dimensions, then fewest metrics as tiebreaker). @@ -59,13 +90,26 @@ Pre-aggregates support metrics that can be re-aggregated from pre-computed resul - `max` - `average` -### Unsupported metric types +### Current limitations + +Not all metrics work this way. Consider `count_distinct` with the same daily pre-aggregate from above. If a daily pre-aggregate stores "2 distinct customers on 2024-01-15" and "1 distinct customer on 2024-01-16", you can't sum those to get the monthly distinct count — Alice ordered on both days and would be counted twice: + +| order_date_day | status | distinct_customers | +|---|---|---| +| 2024-01-15 | shipped | 2 (Alice, Bob) | +| 2024-01-16 | shipped | 1 (Alice) | -Queries that include any of the following metric types will **not** match a pre-aggregate and will query the warehouse directly: +Re-aggregating: 2 + 1 = **3**, but the correct monthly answer is **2** (Alice, Bob). The pre-aggregate lost track of *which* customers were counted. -- `count_distinct`, `sum_distinct`, `average_distinct` +We're investigating supporting `count_distinct` through approximation algorithms. [Follow this issue](https://github.com/lightdash/lightdash/issues/21536) for updates. + +For similar reasons, the following metric types are also not supported: + +- `sum_distinct`, `average_distinct` - `median`, `percentile` - `percent_of_total`, `percent_of_previous` - `running_total` +- Custom SQL metrics — [Follow this issue](https://github.com/lightdash/lightdash/issues/21537) - `number`, `string`, `date`, `timestamp`, `boolean` -- Metrics with custom SQL expressions + +For metrics that can't be pre-aggregated, consider using [caching](/guides/developer/caching) instead.