diff --git a/docs/query-acceleration/hints/hints-overview.md b/docs/query-acceleration/hints/hints-overview.md index fcf8551beda34..aa309ca34db93 100644 --- a/docs/query-acceleration/hints/hints-overview.md +++ b/docs/query-acceleration/hints/hints-overview.md @@ -13,11 +13,12 @@ Currently, Doris possesses excellent out-of-the-box capabilities. In most scenar ## Hint Classification -Doris currently supports several types of hints, including leading hint, ordered hint, and distribute hint: +Doris currently supports several types of hints, including leading hint, ordered hint, distribute hint, and skew hint: -- [Leading Hint](leading-hint.md):Specifies the join order according to the order provided in the leading hint. -- [Ordered Hint](leading-hint.md):A specific type of leading hint that specifies the join order as the original text sequence. -- [Distribute Hint](distribute-hint.md):Specifies the data distribution method for joins as either shuffle or broadcast. +- [Leading Hint](leading-hint.md):Specifies the join order according to the order provided in the leading hint. +- [Ordered Hint](leading-hint.md):A specific type of leading hint that specifies the join order as the original text sequence. +- [Distribute Hint](distribute-hint.md):Specifies the data distribution method for joins as either shuffle or broadcast. +- [Skew Hint](skew-hint.md): Mitigates data skew in joins, distinct aggregations, and window functions. ## Hint Example Imagine a table with a large amount of data. In certain specific cases, you may know that the join order of the tables can affect query performance. In such situations, the Leading Hint allows you to specify the table join order you want the optimizer to follow. @@ -85,6 +86,6 @@ Users can view the effectiveness and reasons for non-effectiveness through the H ## Summary -Hints are powerful tools for manually managing execution plans. Currently, Doris supports leading hint, ordered hint, distribute hint, etc., enabling users to manually manage join order, shuffle methods, and other variable configurations, providing users with more convenient and effective operational capabilities. +Hints are powerful tools for manually managing execution plans. Currently, Doris supports leading hint, ordered hint, distribute hint, skew hint, etc., enabling users to manually manage join order, shuffle methods, and skew-related execution strategies. diff --git a/docs/query-acceleration/hints/skew-hint.md b/docs/query-acceleration/hints/skew-hint.md new file mode 100644 index 0000000000000..b072f880c1940 --- /dev/null +++ b/docs/query-acceleration/hints/skew-hint.md @@ -0,0 +1,316 @@ +--- +{ + "title": "Skew Hint", + "language": "en", + "description": "Skew Hint is used to mitigate data skew in query execution." +} +--- + +## Overview + +Skew Hint is used to mitigate data skew in query execution. + +## Join Skew Hint + +### Overview + +`SaltJoin` is used to mitigate data skew in join scenarios. When join keys contain known hot values, the optimizer introduces a salt column to spread hot-key rows across multiple parallel instances, preventing a single instance from becoming the bottleneck. + +The primary goal of this rewrite is to reduce local overload risk caused by hot keys in `Shuffle Join` scenarios and improve overall execution stability. + +### Applicable Scenarios + +1. Obvious one-sided skew: one side of the join has highly concentrated hot keys. + +2. Known skewed values: you can explicitly provide skewed value lists through hints. + +3. `Shuffle Join` is required: the other table is too large for `Broadcast Join`. + +### Supported Join Types + +- `INNER JOIN` +- `LEFT JOIN` +- `RIGHT JOIN` + +### Usage + +#### Method 1: comment hint + +```sql +SELECT /*+ leading(tl shuffle[skew(tl.a(1,2))] tr) */ * +FROM tl +INNER JOIN tr ON tl.a = tr.a; +``` + +#### Method 2: join hint syntax + +```sql +SELECT * +FROM tl +JOIN[shuffle[skew(tl.a(1,2))]] tr ON tl.a = tr.a; +``` + +Parameter notes: + +- `tl`: alias of the left table. +- `tr`: alias of the right table. +- `tl.a`: skewed column. +- `(1,2)`: list of known skewed values. + +Example: + +Create test tables and insert data: + +```sql +-- Create left table tl +CREATE TABLE IF NOT EXISTS tl ( + id INT, + a INT, + name STRING, + value DOUBLE +) USING parquet; + +-- Create right table tr +CREATE TABLE IF NOT EXISTS tr ( + id INT, + a INT, + description STRING, + amount DOUBLE +) USING parquet; + +-- Insert left table data (simulated skew) +INSERT INTO tl VALUES +(1, 1, 'name_1', 100.0), +(2, 1, 'name_2', 200.0), +(3, 1, 'name_3', 300.0), +(4, 1, 'name_4', 400.0), +(5, 2, 'name_5', 500.0), +(6, 2, 'name_6', 600.0), +(7, 2, 'name_7', 700.0), +(8, 3, 'name_8', 800.0), +(9, 4, 'name_9', 900.0), +(10, 5, 'name_10', 1000.0); + +-- Insert right table data +INSERT INTO tr VALUES +(1, 1, 'desc_1', 150.0), +(2, 1, 'desc_2', 250.0), +(3, 2, 'desc_3', 350.0), +(4, 2, 'desc_4', 450.0), +(5, 3, 'desc_5', 550.0), +(6, 4, 'desc_6', 650.0), +(7, 5, 'desc_7', 750.0), +(8, 1, 'desc_8', 850.0), +(9, 2, 'desc_9', 950.0); +``` + +Use salt join to optimize queries: + +Example 1: optimize inner join + +```sql +-- Comment hint syntax +SELECT /*+leading(tl shuffle[skew(tl.a(1,2))] tr)*/ + tl.id as tl_id, + tl.name, + tr.description, + tl.value + tr.amount as total +FROM tl +INNER JOIN tr ON tl.a = tr.a +WHERE tl.value > 300.0; + +-- Join hint syntax +SELECT + tl.id as tl_id, + tl.name, + tr.description, + tl.value + tr.amount as total +FROM tl +JOIN[shuffle[skew(tl.a(1,2))]] tr ON tl.a = tr.a +WHERE tl.value > 300.0; +``` + +Example 2: optimize left join + +```sql +-- Mitigate skew on the left table in a left join +SELECT /*+leading(tl shuffle[skew(tl.a(1,2))] tr)*/ + tl.id, + tl.a, + tl.name, + COALESCE(tr.description, 'No Match') as description +FROM tl +LEFT JOIN tr ON tl.a = tr.a +ORDER BY tl.id; +``` + +Example 3: optimize right join + +```sql +-- Mitigate skew on the right table in a right join +SELECT /*+leading(tl shuffle[skew(tr.a(1,2))] tr)*/ + tr.id, + tr.a, + tr.description, + COALESCE(tl.name, 'No Match') as name +FROM tl +RIGHT JOIN tr ON tl.a = tr.a +WHERE tr.amount > 500.0; +``` + +### Optimization Principle + +The core idea is a salting rewrite for hot keys. + +After skew values are specified via `skew(...)`, the optimizer introduces a salt column on the skewed side and rewrites the join condition from `key` to `(key, salt)`. This spreads hot-key rows across parallel instances instead of concentrating them in a single worker. + +To keep join semantics correct, the other side is expanded by the same salt buckets for the corresponding skewed keys, so rows can still match on `(key, salt)`. + +A simplified flow: + +1. Identify and mark hot values. + +2. Add salt on the skewed side to split hot rows. + +3. Expand matching rows on the other side by salt buckets, then join. + +This strategy works best for one-sided skew and can significantly reduce hotspot pressure while improving parallelism and execution stability. + +### Limitations + +`SaltJoin` can only mitigate one-sided hotspots and cannot fully eliminate two-sided skew on the same key. + +With left-side skew as an example, the rewrite randomly salts hot keys on the left side and expands rows on the right side by salt value. The join condition changes from `key` to `(key, salt)`, so the left-side hotspot is distributed. + +However, the right side does not reduce hotspot data; it is duplicated across salt partitions for matching. Therefore, when both sides are highly skewed on the same key, this rewrite can reduce pressure on one side but cannot completely fix hotspots on the other side. + +For example, if both left and right tables each contain 100 rows with `key=1` and the bucket count is 100, the left-side rows are distributed across 100 buckets, while right-side rows are expanded so each bucket still contains those 100 rows. Left-side pressure decreases, but right-side skew remains significant. + +## AGG Skew Hint + +### Overview + +`Count Distinct Skew Rewrite` is used to mitigate NDV skew in `DISTINCT` aggregations. + +A typical case is: `GROUP BY a` has a small number of groups, but one hot group (for example, `a=1`) has an extremely large `DISTINCT b`, causing a single instance to hold a very large dedup hash table and leading to memory pressure and long tails. + +This rewrite uses salting buckets plus multi-stage aggregation to split distinct processing inside hot groups and reduce per-instance load. + +### Applicable Scenarios + +1. Obvious NDV skew in `DISTINCT` aggregations: a few groups have abnormally high cardinality. + +2. Long-tail latency, high memory watermark, or OOM risk with normal multi-stage distinct aggregation. + +3. Query is `GROUP BY`-centric and the target distinct argument can be explicitly marked with `[skew]`. + +### Usage + +```sql +SELECT a, COUNT(DISTINCT [skew] b) +FROM t +GROUP BY a; +``` + +### Supported Functions + +Currently, AGG skew rewrite supports the following aggregate functions: + +- `COUNT` +- `SUM` +- `SUM0` +- `GROUP_CONCAT` + +Only the functions above support AGG skew rewrite. Other aggregate functions fall back to the regular plan. + +### Optimization Principle + +For `SELECT a, COUNT(DISTINCT [skew] b) FROM t GROUP BY a`, the flow is: + +1. Apply local deduplication first to reduce raw data volume. + +2. Compute a bucket column for the distinct argument (for example `saltExpr = xxhash_32(b) % bucket_num`). + +3. Distribute by `(a, saltExpr)` and run `multi_distinct_count`. + +4. Aggregate by `a` again and merge bucket results to produce final `COUNT(DISTINCT b)`. + +The key benefit is that hot groups are no longer handled by one large dedup hash structure; they are split into buckets and processed in parallel. + +### Limitations + +`Count Distinct Skew Rewrite` is condition-based. If conditions are not met, the optimizer falls back to the normal aggregation plan. Common limitations include: + +1. `GROUP BY` is required (pure global aggregation does not trigger it). + +2. The target must be a single-argument `DISTINCT` aggregation and marked with `[skew]`. + +3. If the same level has more complex multi-aggregation combinations, the optimizer may skip this rewrite. + +4. If the distinct argument is already included in `GROUP BY`, this rewrite usually provides no benefit and will not trigger. + +### Recommendations + +1. Prioritize `[skew]` for `DISTINCT` aggregations with clear hotspots. + +2. Tune `skew_rewrite_agg_bucket_num` based on data scale to avoid too few buckets (insufficient split) or too many buckets (higher scheduling and merge overhead). + +3. Compare `EXPLAIN`/`PROFILE` before and after optimization to verify reductions in long-tail latency and memory peak. + +## Window Skew Hint + +### Overview + +`Window Skew Rewrite` mitigates sort long-tail issues in window functions when `PARTITION BY` keys are skewed. + +When some partition keys (such as user ID or organization ID) are highly concentrated, conventional execution accumulates large sort and window workloads on a few instances, and the slowest instance dominates total latency. + +### Applicable Scenarios + +1. Obvious hotspots on window `PARTITION BY` keys. + +2. Window queries with `ORDER BY` where sorting is the main bottleneck. + +3. Multiple window expressions in one SQL statement, sharing all or part of the partition keys. + +### Usage + +Mark `[skew]` directly in the `PARTITION BY` clause: + +```sql +SELECT + SUM(a) OVER( + PARTITION BY [skew] b + ORDER BY d + ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING + ) AS w1 +FROM test_skew_window; +``` + +### Optimization Principle + +The core idea is to split heavy sorting into two stages: + +1. Perform local sort upstream. + +2. Shuffle by `PARTITION BY` keys. + +3. Run merge sort downstream, then compute window functions. + +Compared with "shuffle then full sort", this approach is usually more stable under skew: hotspot partitions still need processing, but sorting shifts from full re-sorting to merging pre-sorted streams. + +### Limitations + +1. `[skew]` is a partition-key-level hint and mainly targets `PARTITION BY` skew. + +2. This optimization focuses on sorting overhead and does not change window semantics; extremely large single partitions can still cause long tails. + +3. Within the same partition-key group, only lower window nodes that can shuffle apply this strategy; upper nodes reuse existing distribution and order. + +4. Without `PARTITION BY`, window execution cannot use partition-level parallelism to mitigate skew. + +### Recommendations + +1. Prioritize `[skew]` on partition keys with obvious hotspots. + +2. Use `PROFILE` to observe sort-node time, skew metrics, and long-tail instance changes. diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/query-acceleration/hints/hints-overview.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/query-acceleration/hints/hints-overview.md index e9d24f7a2896f..8697662be931b 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/query-acceleration/hints/hints-overview.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/query-acceleration/hints/hints-overview.md @@ -13,11 +13,12 @@ ## Hint 分类 -Doris 目前支持以下几种 hint 类型,包括 leading hint,ordered hint,distribute hint 等: +Doris 目前支持以下几种 hint 类型,包括 leading hint,ordered hint,distribute hint,skew hint 等: - [Leading Hint](leading-hint.md):用于指定 join order 为 leading 中提供的 order 顺序; - [Ordered Hint](leading-hint.md):一种特定的 leading hint, 用于指定 join order 为原始文本序; - [Distribute Hint](distribute-hint.md):用于指定 join 的数据分发方式为 shuffle 还是 broadcast。 +- [Skew Hint](skew-hint.md):用于缓解 Join、DISTINCT 聚合和窗口函数中的数据倾斜问题。 ## Hint 示例 @@ -86,4 +87,4 @@ Hint Log 分为三个状态: ## 总结 -Hint 是手动管理执行计划的强大工具。当前 Doris 支持的 leading hint, ordered hint, distribute hint 等,可以支撑用户手动管理 join order, shuffle 方式以及其他变量配置,给用户提供更方便有效的运维能力。 +Hint 是手动管理执行计划的强大工具。当前 Doris 支持的 leading hint, ordered hint, distribute hint, skew hint 等,可以支撑用户手动管理 join order、shuffle 方式以及倾斜场景下的执行策略,给用户提供更方便有效的运维能力。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/query-acceleration/hints/skew-hint.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/query-acceleration/hints/skew-hint.md new file mode 100644 index 0000000000000..3349e6fc64570 --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/query-acceleration/hints/skew-hint.md @@ -0,0 +1,318 @@ +--- +{ + "title": "Skew Hint", + "language": "zh-CN", + "description": "Skew Hint能够解决特定场景的数据倾斜的问题,本文介绍Skew Hint的适用场景和用法" +} +--- + +## Join Skew Hint + +### 概述 + +`SaltJoin` 用于缓解 Join 场景中的数据倾斜问题。当 Join Key 存在已知热点值时,优化器会引入盐值列(salt column),将热点值打散到多个并行实例执行,避免单个实例成为瓶颈。 + +该规则的核心目标是:在 `Shuffle Join` 场景下,降低热点 Key 导致的局部过载风险,提升整体执行稳定性。 + +### 适用场景 + +1. 单侧倾斜明显:Join 两侧中有一侧的热点 Key 非常集中 + +2. 已知倾斜值:可通过 Hint 显式提供倾斜值列表 + +3. 需要走 Shuffle Join:另一侧表较大,不适合 Broadcast Join + +### 支持的 Join 类型 + +* INNER JOIN + +* LEFT JOIN + +* RIGHT JOIN + + + +### 使用方式 + +#### 方法一:使用注释 Hint + +```sql +SELECT /*+leading(tl shuffle[skew(tl.a(1,2))] tr)*/ * +FROM tl +INNER JOIN tr ON tl.a = tr.a; +``` + + + +#### 方法二:使用 Join Hint 语法 + +```sql +SELECT * +FROM tl +JOIN[shuffle[skew(tl.a(1,2))]] tr ON tl.a = tr.a; +``` + +参数说明: + +* `tl`:左表别名 + +* `tr`:右表别名 + +* `tl.a`:存在倾斜的列 + +* `(1,2)`:已知的倾斜值列表 + + + +示例: + +创建测试表并插入数据: + +```sql +-- 创建左表 tl +CREATE TABLE IF NOT EXISTS tl ( + id INT, + a INT, + name STRING, + value DOUBLE +) PROPERTIES("replication_num"="1"); + +-- 创建右表 tr +CREATE TABLE IF NOT EXISTS tr ( + id INT, + a INT, + description STRING, + amount DOUBLE +) PROPERTIES("replication_num"="1"); + +-- 插入左表数据(模拟数据倾斜) +INSERT INTO tl VALUES +(1, 1, 'name_1', 100.0), -- 倾斜值 1 +(2, 1, 'name_2', 200.0), -- 倾斜值 1 +(3, 1, 'name_3', 300.0), -- 倾斜值 1 +(4, 1, 'name_4', 400.0), -- 倾斜值 1 +(5, 2, 'name_5', 500.0), -- 倾斜值 2 +(6, 2, 'name_6', 600.0), -- 倾斜值 2 +(7, 2, 'name_7', 700.0), -- 倾斜值 2 +(8, 3, 'name_8', 800.0), -- 正常值 +(9, 4, 'name_9', 900.0), -- 正常值 +(10, 5, 'name_10', 1000.0); -- 正常值 + +-- 插入右表数据 +INSERT INTO tr VALUES +(1, 1, 'desc_1', 150.0), -- 对应倾斜值 1 +(2, 1, 'desc_2', 250.0), -- 对应倾斜值 1 +(3, 2, 'desc_3', 350.0), -- 对应倾斜值 2 +(4, 2, 'desc_4', 450.0), -- 对应倾斜值 2 +(5, 3, 'desc_5', 550.0), -- 对应正常值 +(6, 4, 'desc_6', 650.0), -- 对应正常值 +(7, 5, 'desc_7', 750.0); -- 对应正常值 +``` + +使用salt join优化查询: + +示例1:inner join优化 + +```sql +-- 使用 Hint 语法 +SELECT /*+leading(tl shuffle[skew(tl.a(1,2))] tr)*/ + tl.id as tl_id, + tl.name, + tr.description, + tl.value + tr.amount as total +FROM tl +INNER JOIN tr ON tl.a = tr.a +WHERE tl.value > 300.0; + +-- 使用 Join 提示语法 +SELECT + tl.id as tl_id, + tl.name, + tr.description, + tl.value + tr.amount as total +FROM tl +JOIN[shuffle[skew(tl.a(1,2))]] tr ON tl.a = tr.a +WHERE tl.value > 300.0; +``` + +示例2:left join优化 + +```sql +-- 优化左连接中左表的倾斜问题 +SELECT /*+leading(tl shuffle[skew(tl.a(1,2))] tr)*/ + tl.id, + tl.a, + tl.name,COALESCE(tr.description, 'No Match') as description +FROM tl +LEFT JOIN tr ON tl.a = tr.a +ORDER BY tl.id; +``` + +示例3:right join优化 + +```sql +-- 优化右连接中右表的倾斜问题 +SELECT /*+leading(tl shuffle[skew(tr.a(1,2))] tr)*/ + tr.id, + tr.a, + tr.description, + COALESCE(tl.name, 'No Match') as name +FROM tl +RIGHT JOIN tr ON tl.a = tr.a +WHERE tr.amount > 500.0; +``` + +### 优化原理 + +Join Skew Hint 的核心是对热点 Key 做“加盐重写(salting rewrite)”。 + +当通过 `skew(...)` 指定已知倾斜值后,优化器会在倾斜侧引入盐值列(salt),把 Join 条件从 `key` 扩展为 `(key, salt)`,将热点数据打散到多个并行实例执行,避免单实例成为热点瓶颈。 + +为了保证 Join 语义正确,另一侧会按相同盐值桶对对应热点 Key 进行扩展,使两侧仍可在 `(key, salt)` 上准确匹配。 + +简化理解为三步: + +1. 识别并标记热点值 + +2. 倾斜侧加盐打散 + +3. 非倾斜侧扩展后再 Join + +该策略最适合“单侧明显倾斜”的场景:可以显著降低热点侧的局部过载风险,提升整体并行度与稳定性。 + + + +### 重要限制 + +`SaltJoin` 只能缓解单侧热点,不能同时消除双侧同 Key 倾斜。 + +以左表倾斜为例,规则会在左表对热点值随机打盐,并在右表按盐值扩行,使 Join 条件从 `key` 扩展为 `(key, salt)`。这样左表热点会被打散到多个实例。 + +但右表并不会减少热点数据,只是为了匹配而被复制到多个盐值分区中。因此,当两侧在同一 Key 上都高度倾斜时,该规则只能降低一侧压力,无法同时解决另一侧的热点。 + +例如:左表有 100 行 `key=1`,右表也有 100 行 `key=1`,盐值桶数为 100。重写后左表 100 行会被分散到 100 个桶;右表会被扩展到每个桶都包含这 100 行。结果是左表负载下降,但右表每个实例仍可能处理大量 `key=1`,右侧倾斜并未被解决。 + + + +## AGG Skew Hint + +### 概述 + +`AGG Skew Hint` 用于缓解 `DISTINCT` 聚合中的 NDV 倾斜问题。 + +典型场景是:`GROUP BY a` 的分组数量不大,但某个热点分组(例如 `a=1`)对应的 `DISTINCT b` 数量特别大,导致单个实例维护超大去重哈希表,出现内存压力和长尾。 + +该规则通过“加桶(salt bucket)+ 多阶段聚合”把热点分组内的去重计算进一步拆散,从而降低单实例负载。 + +### 适用场景 + +1. `DISTINCT` 聚合存在明显 NDV 倾斜:少数分组的去重基数异常高 + +2. 常规多阶段 `DISTINCT` 聚合出现倾斜、或内存高水位。 + +### 使用方式 + +```sql +SELECT a, COUNT(DISTINCT [skew] b) +FROM t +GROUP BY a; +``` + +### 当前支持的函数 + +目前,AGG 改写支持以下聚合函数: + +* `COUNT` + +* `SUM` + +* `SUM0` + +* `GROUP_CONCAT` + +仅上述函数支持 AGG skew rewrite,其他聚合函数会回退常规计划。 + +### 优化原理 + +以 `SELECT a, COUNT(DISTINCT [skew] b) FROM t GROUP BY a` 为例,核心流程可理解为: + +1. 先做一次局部去重,降低原始数据量 + +2. 为 `DISTINCT` 参数计算桶列(例如 `saltExpr = xxhash_32(b) % bucket_num`) + +3. 按 `(a, saltExpr)` 做分布并执行 `multi_distinct_count` + +4. 再按 `a` 聚合并合并各桶结果,得到最终 `COUNT(DISTINCT b)` + +这样做的关键收益是:热点分组不再由单个去重哈希结构“硬扛”,而是按桶拆分后并行处理。 + +### 重要限制 + +加上Hint之后的改写是按触发条件生效的,不满足条件会自动回退到常规聚合计划。常见限制包括: + +1. 必须有 `GROUP BY` + +2. 目标是单参数 `DISTINCT` 聚合 + +3. 当 `DISTINCT` 参数已包含在 `GROUP BY` 中时,该重写通常没有收益,不会触发 + +### 使用建议 + +1. 优先在明显热点的 `DISTINCT` 聚合上使用 `[skew]` + +2. 结合业务数据规模调节 `skew_rewrite_agg_bucket_num`,避免桶数过小(打散不足)或过大(调度与汇总开销上升) + +3. 使用 `EXPLAIN`/`PROFILE` 对比优化前后是否明显降低了长尾实例耗时和内存峰值 + +## Window Skew Hint + +### 概述 + +`Window Skew Hint` 用于缓解窗口函数在 `PARTITION BY` 键倾斜时的排序长尾问题。 + +当某些分区键(如用户 ID、机构 ID)高度集中时,传统窗口执行会在少数实例上堆积大量排序与窗口计算,导致整体查询被最慢实例拖慢。 + +### 适用场景 + +1. 窗口函数 `PARTITION BY` 键存在明显热点 + +2. 查询包含 `ORDER BY` 的窗口计算,排序阶段成为主要瓶颈 + +### 使用方式 + +在 `PARTITION BY` 子句中显式标记 `[skew]`: + +```sql +SELECT + SUM(a) OVER( + PARTITION BY [skew] b + ORDER BY d + ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING + ) AS w1 +FROM test_skew_window; +``` + +### 优化原理 + +核心思路是把“重排序”拆成两段: + +1. 先在上游做局部排序(Local Sort) + +2. 再按 `PARTITION BY` 键做 Shuffle + +3. 在下游执行归并排序(Merge Sort)后进行窗口计算 + +相比“Shuffle 后全量排序”,这种方式在倾斜分区上通常更稳定:同样要处理热点分区的数据,但排序开销从全量重排转为归并已排序数据流。 + +### 重要限制 + +1. `[skew]` 是窗口分区键级别的提示,主要作用于 `PARTITION BY` 倾斜场景 + +2. 该优化重点缓解排序开销,不改变窗口语义;若单个分区本身极大,仍可能出现执行长尾 + + +### 使用建议 + +1. 优先在热点明显的 `PARTITION BY` 键上使用 `[skew]` + +2. 结合 `PROFILE` 观察排序节点耗时、数据倾斜指标和长尾实例变化 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/query-acceleration/hints/hints-overview.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/query-acceleration/hints/hints-overview.md index 3dc6f3fccfb49..81eff40b3cffd 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/query-acceleration/hints/hints-overview.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/query-acceleration/hints/hints-overview.md @@ -18,6 +18,7 @@ Doris 目前支持以下几种 hint 类型,包括 leading hint,ordered hint - [Leading Hint](leading-hint.md):用于指定 join order 为 leading 中提供的 order 顺序; - [Ordered Hint](leading-hint.md):一种特定的 leading hint, 用于指定 join order 为原始文本序; - [Distribute Hint](distribute-hint.md):用于指定 join 的数据分发方式为 shuffle 还是 broadcast。 +- [Skew Hint](skew-hint.md):用于提供数据分布的倾斜信息,指导优化器调整计划。 ## Hint 示例 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/query-acceleration/hints/skew-hint.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/query-acceleration/hints/skew-hint.md new file mode 100644 index 0000000000000..3349e6fc64570 --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/query-acceleration/hints/skew-hint.md @@ -0,0 +1,318 @@ +--- +{ + "title": "Skew Hint", + "language": "zh-CN", + "description": "Skew Hint能够解决特定场景的数据倾斜的问题,本文介绍Skew Hint的适用场景和用法" +} +--- + +## Join Skew Hint + +### 概述 + +`SaltJoin` 用于缓解 Join 场景中的数据倾斜问题。当 Join Key 存在已知热点值时,优化器会引入盐值列(salt column),将热点值打散到多个并行实例执行,避免单个实例成为瓶颈。 + +该规则的核心目标是:在 `Shuffle Join` 场景下,降低热点 Key 导致的局部过载风险,提升整体执行稳定性。 + +### 适用场景 + +1. 单侧倾斜明显:Join 两侧中有一侧的热点 Key 非常集中 + +2. 已知倾斜值:可通过 Hint 显式提供倾斜值列表 + +3. 需要走 Shuffle Join:另一侧表较大,不适合 Broadcast Join + +### 支持的 Join 类型 + +* INNER JOIN + +* LEFT JOIN + +* RIGHT JOIN + + + +### 使用方式 + +#### 方法一:使用注释 Hint + +```sql +SELECT /*+leading(tl shuffle[skew(tl.a(1,2))] tr)*/ * +FROM tl +INNER JOIN tr ON tl.a = tr.a; +``` + + + +#### 方法二:使用 Join Hint 语法 + +```sql +SELECT * +FROM tl +JOIN[shuffle[skew(tl.a(1,2))]] tr ON tl.a = tr.a; +``` + +参数说明: + +* `tl`:左表别名 + +* `tr`:右表别名 + +* `tl.a`:存在倾斜的列 + +* `(1,2)`:已知的倾斜值列表 + + + +示例: + +创建测试表并插入数据: + +```sql +-- 创建左表 tl +CREATE TABLE IF NOT EXISTS tl ( + id INT, + a INT, + name STRING, + value DOUBLE +) PROPERTIES("replication_num"="1"); + +-- 创建右表 tr +CREATE TABLE IF NOT EXISTS tr ( + id INT, + a INT, + description STRING, + amount DOUBLE +) PROPERTIES("replication_num"="1"); + +-- 插入左表数据(模拟数据倾斜) +INSERT INTO tl VALUES +(1, 1, 'name_1', 100.0), -- 倾斜值 1 +(2, 1, 'name_2', 200.0), -- 倾斜值 1 +(3, 1, 'name_3', 300.0), -- 倾斜值 1 +(4, 1, 'name_4', 400.0), -- 倾斜值 1 +(5, 2, 'name_5', 500.0), -- 倾斜值 2 +(6, 2, 'name_6', 600.0), -- 倾斜值 2 +(7, 2, 'name_7', 700.0), -- 倾斜值 2 +(8, 3, 'name_8', 800.0), -- 正常值 +(9, 4, 'name_9', 900.0), -- 正常值 +(10, 5, 'name_10', 1000.0); -- 正常值 + +-- 插入右表数据 +INSERT INTO tr VALUES +(1, 1, 'desc_1', 150.0), -- 对应倾斜值 1 +(2, 1, 'desc_2', 250.0), -- 对应倾斜值 1 +(3, 2, 'desc_3', 350.0), -- 对应倾斜值 2 +(4, 2, 'desc_4', 450.0), -- 对应倾斜值 2 +(5, 3, 'desc_5', 550.0), -- 对应正常值 +(6, 4, 'desc_6', 650.0), -- 对应正常值 +(7, 5, 'desc_7', 750.0); -- 对应正常值 +``` + +使用salt join优化查询: + +示例1:inner join优化 + +```sql +-- 使用 Hint 语法 +SELECT /*+leading(tl shuffle[skew(tl.a(1,2))] tr)*/ + tl.id as tl_id, + tl.name, + tr.description, + tl.value + tr.amount as total +FROM tl +INNER JOIN tr ON tl.a = tr.a +WHERE tl.value > 300.0; + +-- 使用 Join 提示语法 +SELECT + tl.id as tl_id, + tl.name, + tr.description, + tl.value + tr.amount as total +FROM tl +JOIN[shuffle[skew(tl.a(1,2))]] tr ON tl.a = tr.a +WHERE tl.value > 300.0; +``` + +示例2:left join优化 + +```sql +-- 优化左连接中左表的倾斜问题 +SELECT /*+leading(tl shuffle[skew(tl.a(1,2))] tr)*/ + tl.id, + tl.a, + tl.name,COALESCE(tr.description, 'No Match') as description +FROM tl +LEFT JOIN tr ON tl.a = tr.a +ORDER BY tl.id; +``` + +示例3:right join优化 + +```sql +-- 优化右连接中右表的倾斜问题 +SELECT /*+leading(tl shuffle[skew(tr.a(1,2))] tr)*/ + tr.id, + tr.a, + tr.description, + COALESCE(tl.name, 'No Match') as name +FROM tl +RIGHT JOIN tr ON tl.a = tr.a +WHERE tr.amount > 500.0; +``` + +### 优化原理 + +Join Skew Hint 的核心是对热点 Key 做“加盐重写(salting rewrite)”。 + +当通过 `skew(...)` 指定已知倾斜值后,优化器会在倾斜侧引入盐值列(salt),把 Join 条件从 `key` 扩展为 `(key, salt)`,将热点数据打散到多个并行实例执行,避免单实例成为热点瓶颈。 + +为了保证 Join 语义正确,另一侧会按相同盐值桶对对应热点 Key 进行扩展,使两侧仍可在 `(key, salt)` 上准确匹配。 + +简化理解为三步: + +1. 识别并标记热点值 + +2. 倾斜侧加盐打散 + +3. 非倾斜侧扩展后再 Join + +该策略最适合“单侧明显倾斜”的场景:可以显著降低热点侧的局部过载风险,提升整体并行度与稳定性。 + + + +### 重要限制 + +`SaltJoin` 只能缓解单侧热点,不能同时消除双侧同 Key 倾斜。 + +以左表倾斜为例,规则会在左表对热点值随机打盐,并在右表按盐值扩行,使 Join 条件从 `key` 扩展为 `(key, salt)`。这样左表热点会被打散到多个实例。 + +但右表并不会减少热点数据,只是为了匹配而被复制到多个盐值分区中。因此,当两侧在同一 Key 上都高度倾斜时,该规则只能降低一侧压力,无法同时解决另一侧的热点。 + +例如:左表有 100 行 `key=1`,右表也有 100 行 `key=1`,盐值桶数为 100。重写后左表 100 行会被分散到 100 个桶;右表会被扩展到每个桶都包含这 100 行。结果是左表负载下降,但右表每个实例仍可能处理大量 `key=1`,右侧倾斜并未被解决。 + + + +## AGG Skew Hint + +### 概述 + +`AGG Skew Hint` 用于缓解 `DISTINCT` 聚合中的 NDV 倾斜问题。 + +典型场景是:`GROUP BY a` 的分组数量不大,但某个热点分组(例如 `a=1`)对应的 `DISTINCT b` 数量特别大,导致单个实例维护超大去重哈希表,出现内存压力和长尾。 + +该规则通过“加桶(salt bucket)+ 多阶段聚合”把热点分组内的去重计算进一步拆散,从而降低单实例负载。 + +### 适用场景 + +1. `DISTINCT` 聚合存在明显 NDV 倾斜:少数分组的去重基数异常高 + +2. 常规多阶段 `DISTINCT` 聚合出现倾斜、或内存高水位。 + +### 使用方式 + +```sql +SELECT a, COUNT(DISTINCT [skew] b) +FROM t +GROUP BY a; +``` + +### 当前支持的函数 + +目前,AGG 改写支持以下聚合函数: + +* `COUNT` + +* `SUM` + +* `SUM0` + +* `GROUP_CONCAT` + +仅上述函数支持 AGG skew rewrite,其他聚合函数会回退常规计划。 + +### 优化原理 + +以 `SELECT a, COUNT(DISTINCT [skew] b) FROM t GROUP BY a` 为例,核心流程可理解为: + +1. 先做一次局部去重,降低原始数据量 + +2. 为 `DISTINCT` 参数计算桶列(例如 `saltExpr = xxhash_32(b) % bucket_num`) + +3. 按 `(a, saltExpr)` 做分布并执行 `multi_distinct_count` + +4. 再按 `a` 聚合并合并各桶结果,得到最终 `COUNT(DISTINCT b)` + +这样做的关键收益是:热点分组不再由单个去重哈希结构“硬扛”,而是按桶拆分后并行处理。 + +### 重要限制 + +加上Hint之后的改写是按触发条件生效的,不满足条件会自动回退到常规聚合计划。常见限制包括: + +1. 必须有 `GROUP BY` + +2. 目标是单参数 `DISTINCT` 聚合 + +3. 当 `DISTINCT` 参数已包含在 `GROUP BY` 中时,该重写通常没有收益,不会触发 + +### 使用建议 + +1. 优先在明显热点的 `DISTINCT` 聚合上使用 `[skew]` + +2. 结合业务数据规模调节 `skew_rewrite_agg_bucket_num`,避免桶数过小(打散不足)或过大(调度与汇总开销上升) + +3. 使用 `EXPLAIN`/`PROFILE` 对比优化前后是否明显降低了长尾实例耗时和内存峰值 + +## Window Skew Hint + +### 概述 + +`Window Skew Hint` 用于缓解窗口函数在 `PARTITION BY` 键倾斜时的排序长尾问题。 + +当某些分区键(如用户 ID、机构 ID)高度集中时,传统窗口执行会在少数实例上堆积大量排序与窗口计算,导致整体查询被最慢实例拖慢。 + +### 适用场景 + +1. 窗口函数 `PARTITION BY` 键存在明显热点 + +2. 查询包含 `ORDER BY` 的窗口计算,排序阶段成为主要瓶颈 + +### 使用方式 + +在 `PARTITION BY` 子句中显式标记 `[skew]`: + +```sql +SELECT + SUM(a) OVER( + PARTITION BY [skew] b + ORDER BY d + ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING + ) AS w1 +FROM test_skew_window; +``` + +### 优化原理 + +核心思路是把“重排序”拆成两段: + +1. 先在上游做局部排序(Local Sort) + +2. 再按 `PARTITION BY` 键做 Shuffle + +3. 在下游执行归并排序(Merge Sort)后进行窗口计算 + +相比“Shuffle 后全量排序”,这种方式在倾斜分区上通常更稳定:同样要处理热点分区的数据,但排序开销从全量重排转为归并已排序数据流。 + +### 重要限制 + +1. `[skew]` 是窗口分区键级别的提示,主要作用于 `PARTITION BY` 倾斜场景 + +2. 该优化重点缓解排序开销,不改变窗口语义;若单个分区本身极大,仍可能出现执行长尾 + + +### 使用建议 + +1. 优先在热点明显的 `PARTITION BY` 键上使用 `[skew]` + +2. 结合 `PROFILE` 观察排序节点耗时、数据倾斜指标和长尾实例变化 diff --git a/sidebars.ts b/sidebars.ts index f50f5ab779469..30c2b0cdb766b 100644 --- a/sidebars.ts +++ b/sidebars.ts @@ -673,6 +673,7 @@ const sidebars: SidebarsConfig = { 'query-acceleration/hints/hints-overview', 'query-acceleration/hints/leading-hint', 'query-acceleration/hints/distribute-hint', + 'query-acceleration/hints/skew-hint', ], }, { diff --git a/versioned_docs/version-4.x/query-acceleration/hints/hints-overview.md b/versioned_docs/version-4.x/query-acceleration/hints/hints-overview.md index fcf8551beda34..a5c6575c7a90d 100644 --- a/versioned_docs/version-4.x/query-acceleration/hints/hints-overview.md +++ b/versioned_docs/version-4.x/query-acceleration/hints/hints-overview.md @@ -1,11 +1,11 @@ ---- -{ - "title": "Overview of Hints", - "language": "en", - "description": "Database Hints are query optimization techniques used to guide the database query optimizer on how to generate a specific plan. By providing Hints," -} ---- - +--- +{ + "title": "Overview of Hints", + "language": "en", + "description": "Database Hints are query optimization techniques used to guide the database query optimizer on how to generate a specific plan. By providing Hints," +} +--- + Database Hints are query optimization techniques used to guide the database query optimizer on how to generate a specific plan. By providing Hints, users can fine-tune the default behavior of the query optimizer in hopes of achieving better performance or meeting specific requirements. :::caution Note Currently, Doris possesses excellent out-of-the-box capabilities. In most scenarios, Doris adaptively optimizes performance across various situations without requiring users to manually control hints for business tuning. The content presented in this chapter is primarily intended for professional tuning personnel. Business users can have a brief understanding of it. @@ -18,6 +18,7 @@ Doris currently supports several types of hints, including leading hint, ordered - [Leading Hint](leading-hint.md):Specifies the join order according to the order provided in the leading hint. - [Ordered Hint](leading-hint.md):A specific type of leading hint that specifies the join order as the original text sequence. - [Distribute Hint](distribute-hint.md):Specifies the data distribution method for joins as either shuffle or broadcast. +- [Skew Hint](skew-hint.md):Used to provide skewed information about data distribution, guiding the optimizer to adjust the plan. ## Hint Example Imagine a table with a large amount of data. In certain specific cases, you may know that the join order of the tables can affect query performance. In such situations, the Leading Hint allows you to specify the table join order you want the optimizer to follow. diff --git a/versioned_docs/version-4.x/query-acceleration/hints/skew-hint.md b/versioned_docs/version-4.x/query-acceleration/hints/skew-hint.md new file mode 100644 index 0000000000000..b072f880c1940 --- /dev/null +++ b/versioned_docs/version-4.x/query-acceleration/hints/skew-hint.md @@ -0,0 +1,316 @@ +--- +{ + "title": "Skew Hint", + "language": "en", + "description": "Skew Hint is used to mitigate data skew in query execution." +} +--- + +## Overview + +Skew Hint is used to mitigate data skew in query execution. + +## Join Skew Hint + +### Overview + +`SaltJoin` is used to mitigate data skew in join scenarios. When join keys contain known hot values, the optimizer introduces a salt column to spread hot-key rows across multiple parallel instances, preventing a single instance from becoming the bottleneck. + +The primary goal of this rewrite is to reduce local overload risk caused by hot keys in `Shuffle Join` scenarios and improve overall execution stability. + +### Applicable Scenarios + +1. Obvious one-sided skew: one side of the join has highly concentrated hot keys. + +2. Known skewed values: you can explicitly provide skewed value lists through hints. + +3. `Shuffle Join` is required: the other table is too large for `Broadcast Join`. + +### Supported Join Types + +- `INNER JOIN` +- `LEFT JOIN` +- `RIGHT JOIN` + +### Usage + +#### Method 1: comment hint + +```sql +SELECT /*+ leading(tl shuffle[skew(tl.a(1,2))] tr) */ * +FROM tl +INNER JOIN tr ON tl.a = tr.a; +``` + +#### Method 2: join hint syntax + +```sql +SELECT * +FROM tl +JOIN[shuffle[skew(tl.a(1,2))]] tr ON tl.a = tr.a; +``` + +Parameter notes: + +- `tl`: alias of the left table. +- `tr`: alias of the right table. +- `tl.a`: skewed column. +- `(1,2)`: list of known skewed values. + +Example: + +Create test tables and insert data: + +```sql +-- Create left table tl +CREATE TABLE IF NOT EXISTS tl ( + id INT, + a INT, + name STRING, + value DOUBLE +) USING parquet; + +-- Create right table tr +CREATE TABLE IF NOT EXISTS tr ( + id INT, + a INT, + description STRING, + amount DOUBLE +) USING parquet; + +-- Insert left table data (simulated skew) +INSERT INTO tl VALUES +(1, 1, 'name_1', 100.0), +(2, 1, 'name_2', 200.0), +(3, 1, 'name_3', 300.0), +(4, 1, 'name_4', 400.0), +(5, 2, 'name_5', 500.0), +(6, 2, 'name_6', 600.0), +(7, 2, 'name_7', 700.0), +(8, 3, 'name_8', 800.0), +(9, 4, 'name_9', 900.0), +(10, 5, 'name_10', 1000.0); + +-- Insert right table data +INSERT INTO tr VALUES +(1, 1, 'desc_1', 150.0), +(2, 1, 'desc_2', 250.0), +(3, 2, 'desc_3', 350.0), +(4, 2, 'desc_4', 450.0), +(5, 3, 'desc_5', 550.0), +(6, 4, 'desc_6', 650.0), +(7, 5, 'desc_7', 750.0), +(8, 1, 'desc_8', 850.0), +(9, 2, 'desc_9', 950.0); +``` + +Use salt join to optimize queries: + +Example 1: optimize inner join + +```sql +-- Comment hint syntax +SELECT /*+leading(tl shuffle[skew(tl.a(1,2))] tr)*/ + tl.id as tl_id, + tl.name, + tr.description, + tl.value + tr.amount as total +FROM tl +INNER JOIN tr ON tl.a = tr.a +WHERE tl.value > 300.0; + +-- Join hint syntax +SELECT + tl.id as tl_id, + tl.name, + tr.description, + tl.value + tr.amount as total +FROM tl +JOIN[shuffle[skew(tl.a(1,2))]] tr ON tl.a = tr.a +WHERE tl.value > 300.0; +``` + +Example 2: optimize left join + +```sql +-- Mitigate skew on the left table in a left join +SELECT /*+leading(tl shuffle[skew(tl.a(1,2))] tr)*/ + tl.id, + tl.a, + tl.name, + COALESCE(tr.description, 'No Match') as description +FROM tl +LEFT JOIN tr ON tl.a = tr.a +ORDER BY tl.id; +``` + +Example 3: optimize right join + +```sql +-- Mitigate skew on the right table in a right join +SELECT /*+leading(tl shuffle[skew(tr.a(1,2))] tr)*/ + tr.id, + tr.a, + tr.description, + COALESCE(tl.name, 'No Match') as name +FROM tl +RIGHT JOIN tr ON tl.a = tr.a +WHERE tr.amount > 500.0; +``` + +### Optimization Principle + +The core idea is a salting rewrite for hot keys. + +After skew values are specified via `skew(...)`, the optimizer introduces a salt column on the skewed side and rewrites the join condition from `key` to `(key, salt)`. This spreads hot-key rows across parallel instances instead of concentrating them in a single worker. + +To keep join semantics correct, the other side is expanded by the same salt buckets for the corresponding skewed keys, so rows can still match on `(key, salt)`. + +A simplified flow: + +1. Identify and mark hot values. + +2. Add salt on the skewed side to split hot rows. + +3. Expand matching rows on the other side by salt buckets, then join. + +This strategy works best for one-sided skew and can significantly reduce hotspot pressure while improving parallelism and execution stability. + +### Limitations + +`SaltJoin` can only mitigate one-sided hotspots and cannot fully eliminate two-sided skew on the same key. + +With left-side skew as an example, the rewrite randomly salts hot keys on the left side and expands rows on the right side by salt value. The join condition changes from `key` to `(key, salt)`, so the left-side hotspot is distributed. + +However, the right side does not reduce hotspot data; it is duplicated across salt partitions for matching. Therefore, when both sides are highly skewed on the same key, this rewrite can reduce pressure on one side but cannot completely fix hotspots on the other side. + +For example, if both left and right tables each contain 100 rows with `key=1` and the bucket count is 100, the left-side rows are distributed across 100 buckets, while right-side rows are expanded so each bucket still contains those 100 rows. Left-side pressure decreases, but right-side skew remains significant. + +## AGG Skew Hint + +### Overview + +`Count Distinct Skew Rewrite` is used to mitigate NDV skew in `DISTINCT` aggregations. + +A typical case is: `GROUP BY a` has a small number of groups, but one hot group (for example, `a=1`) has an extremely large `DISTINCT b`, causing a single instance to hold a very large dedup hash table and leading to memory pressure and long tails. + +This rewrite uses salting buckets plus multi-stage aggregation to split distinct processing inside hot groups and reduce per-instance load. + +### Applicable Scenarios + +1. Obvious NDV skew in `DISTINCT` aggregations: a few groups have abnormally high cardinality. + +2. Long-tail latency, high memory watermark, or OOM risk with normal multi-stage distinct aggregation. + +3. Query is `GROUP BY`-centric and the target distinct argument can be explicitly marked with `[skew]`. + +### Usage + +```sql +SELECT a, COUNT(DISTINCT [skew] b) +FROM t +GROUP BY a; +``` + +### Supported Functions + +Currently, AGG skew rewrite supports the following aggregate functions: + +- `COUNT` +- `SUM` +- `SUM0` +- `GROUP_CONCAT` + +Only the functions above support AGG skew rewrite. Other aggregate functions fall back to the regular plan. + +### Optimization Principle + +For `SELECT a, COUNT(DISTINCT [skew] b) FROM t GROUP BY a`, the flow is: + +1. Apply local deduplication first to reduce raw data volume. + +2. Compute a bucket column for the distinct argument (for example `saltExpr = xxhash_32(b) % bucket_num`). + +3. Distribute by `(a, saltExpr)` and run `multi_distinct_count`. + +4. Aggregate by `a` again and merge bucket results to produce final `COUNT(DISTINCT b)`. + +The key benefit is that hot groups are no longer handled by one large dedup hash structure; they are split into buckets and processed in parallel. + +### Limitations + +`Count Distinct Skew Rewrite` is condition-based. If conditions are not met, the optimizer falls back to the normal aggregation plan. Common limitations include: + +1. `GROUP BY` is required (pure global aggregation does not trigger it). + +2. The target must be a single-argument `DISTINCT` aggregation and marked with `[skew]`. + +3. If the same level has more complex multi-aggregation combinations, the optimizer may skip this rewrite. + +4. If the distinct argument is already included in `GROUP BY`, this rewrite usually provides no benefit and will not trigger. + +### Recommendations + +1. Prioritize `[skew]` for `DISTINCT` aggregations with clear hotspots. + +2. Tune `skew_rewrite_agg_bucket_num` based on data scale to avoid too few buckets (insufficient split) or too many buckets (higher scheduling and merge overhead). + +3. Compare `EXPLAIN`/`PROFILE` before and after optimization to verify reductions in long-tail latency and memory peak. + +## Window Skew Hint + +### Overview + +`Window Skew Rewrite` mitigates sort long-tail issues in window functions when `PARTITION BY` keys are skewed. + +When some partition keys (such as user ID or organization ID) are highly concentrated, conventional execution accumulates large sort and window workloads on a few instances, and the slowest instance dominates total latency. + +### Applicable Scenarios + +1. Obvious hotspots on window `PARTITION BY` keys. + +2. Window queries with `ORDER BY` where sorting is the main bottleneck. + +3. Multiple window expressions in one SQL statement, sharing all or part of the partition keys. + +### Usage + +Mark `[skew]` directly in the `PARTITION BY` clause: + +```sql +SELECT + SUM(a) OVER( + PARTITION BY [skew] b + ORDER BY d + ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING + ) AS w1 +FROM test_skew_window; +``` + +### Optimization Principle + +The core idea is to split heavy sorting into two stages: + +1. Perform local sort upstream. + +2. Shuffle by `PARTITION BY` keys. + +3. Run merge sort downstream, then compute window functions. + +Compared with "shuffle then full sort", this approach is usually more stable under skew: hotspot partitions still need processing, but sorting shifts from full re-sorting to merging pre-sorted streams. + +### Limitations + +1. `[skew]` is a partition-key-level hint and mainly targets `PARTITION BY` skew. + +2. This optimization focuses on sorting overhead and does not change window semantics; extremely large single partitions can still cause long tails. + +3. Within the same partition-key group, only lower window nodes that can shuffle apply this strategy; upper nodes reuse existing distribution and order. + +4. Without `PARTITION BY`, window execution cannot use partition-level parallelism to mitigate skew. + +### Recommendations + +1. Prioritize `[skew]` on partition keys with obvious hotspots. + +2. Use `PROFILE` to observe sort-node time, skew metrics, and long-tail instance changes. diff --git a/versioned_sidebars/version-4.x-sidebars.json b/versioned_sidebars/version-4.x-sidebars.json index 49c6cadff6ffe..23d8aa5753022 100644 --- a/versioned_sidebars/version-4.x-sidebars.json +++ b/versioned_sidebars/version-4.x-sidebars.json @@ -677,7 +677,8 @@ "items": [ "query-acceleration/hints/hints-overview", "query-acceleration/hints/leading-hint", - "query-acceleration/hints/distribute-hint" + "query-acceleration/hints/distribute-hint", + "query-acceleration/hints/skew-hint" ] }, {