[SPARK-56034][SQL] Push down Join through Union when the right side is broadcastable by LuciferYang · Pull Request #54865 · apache/spark

LuciferYang · 2026-03-17T12:42:05Z

What changes were proposed in this pull request?

This PR adds a new optimizer rule PushDownJoinThroughUnion that transforms:

Join(Union(c1, c2, ..., cN), right, joinType, cond)

into:

Union(Join(c1, right, joinType, cond1), Join(c2, right, joinType, cond2), ...)

when the right side of the join is small enough to broadcast (by size statistics or explicit BROADCAST hints). The rule applies to Inner and LeftOuter joins.

It is placed after the "Early Filter and Projection Push-Down" batch in the optimizer to ensure accurate data source statistics are available. The rule can be disabled via spark.sql.optimizer.excludedRules.

Key implementation details:

Uses the "fake self-join + DeduplicateRelations" pattern (same as InlineCTE) to create independent copies of the right subtree with fresh ExprIds for each Union branch.
Join condition attributes referencing Union output are rewritten to the corresponding child's output attributes.
Excludes right subtrees containing subqueries, as DeduplicateRelations may not correctly handle correlated references when cloning.

Why are the changes needed?

This is a common pattern in TPC-DS queries (e.g., q2, q5, q54, q5a) and real-world analytics workloads: a large fact table is formed by UNION ALL of multiple sources and then joined with a small dimension table.

Since the rule only fires when the right side is already broadcastable, the total probe work and output volume are the same before and after the transformation — the same rows are probed and the same rows are produced, just at a different position in the plan tree. The broadcast exchange is materialized once and shared across Union branches via ReusedExchange.

The benefit is structural: each Union branch becomes a self-contained subplan, enabling AQE to make independent per-branch adaptive decisions (e.g., coalescing partitions, custom shuffle readers) based on each branch's actual runtime data characteristics.

Does this PR introduce any user-facing change?

Query plans for affected patterns (e.g., TPC-DS q2, q5, q54, q5a) will change — the Join is pushed below the Union, and the broadcast exchange for the right side is shared across Union branches via ReusedExchange.

How was this patch tested?

Added PushDownJoinThroughUnionSuite in sql/catalyst (13 test cases): verifies plan transformation for Inner/LeftOuter joins, attribute rewriting, ExprId uniqueness across Union branches, negative cases (unsupported join types, no condition, Union on right side, right side too large), and complex right-side subtrees (Filter+Project, Generate/Explode, SubqueryAlias, Aggregate).
Added PushDownJoinThroughUnionSuite in sql/core (6 test cases): end-to-end correctness tests including 2-way and 3-way UNION ALL with broadcast join, LeftOuter join, optimization enabled vs excluded comparison, column pruning, and predicate push-down interaction.
Updated TPC-DS plan stability golden files for affected queries: q2, q5, q54 (approved-plans-v1_4) and q5a (approved-plans-v2_7), including both sf100 and default variants.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

LuciferYang · 2026-03-17T12:44:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .createWithDefault(true)

+  val PUSH_DOWN_JOIN_THROUGH_UNION_ENABLED =
+    buildConf("spark.sql.optimizer.pushDownJoinThroughUnion.enabled")


If it needs to be set to false by default, please let me know.

+1 for true by default because this configuration is only a safe-guard for any future regression.

According to the code, we can use spark.sql.optimizer.excludedRules instead of this, right? Is there any difference?

Good point @dongjoon-hyun. You're right — spark.sql.optimizer.excludedRules already provides a general mechanism to disable any optimizer rule, and adding a dedicated config for each rule would lead to config proliferation. I'll remove the dedicated config spark.sql.optimizer.pushDownJoinThroughUnion.enabled and rely on excludedRules instead. Thanks for the suggestion!

35b2c58 address this

dongjoon-hyun · 2026-03-17T15:09:03Z

cc @yaooqinn and @peter-toth , too.

peter-toth · 2026-03-17T15:37:15Z

I might be missing something but I don't get this part:

Without this optimization, Spark must first shuffle the entire Union result before performing the join. With this rule, each Union branch joins independently with the broadcasted right side, eliminating the expensive shuffle of the Union result.

Why Spark needs to shuffle the union result if the right side is small enough to be broadcasted (i.e. the original join was a broadcast join)? Is there a TPCDS plan where an exchange is removed by this PR?

peter-toth · 2026-03-17T15:39:11Z

sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q2.sf100/explain.txt

+* Sort (59)
+- Exchange (58)
+   +- * Project (57)
+      +- * SortMergeJoin Inner (56)


This was BroadcastHashJoin before this PR. Why do we have SortMergeJoin now?

Great catch @peter-toth. I investigated the root cause and it turns out to be a statistics estimation degradation chain triggered by a pre-existing gap in UnionEstimation.

Root cause: UnionEstimation only propagates min/max and nullCount through Union — it does not propagate distinctCount. When PushDownJoinThroughUnion transforms the plan from Aggregate(Join(Union, date_dim)) to Aggregate(Union(Join, Join)), the d_week_seq column loses its distinctCount after passing through the new Union node, which causes AggregateEstimation CBO to fail (since hasCountStats requires both distinctCount and nullCount), falling back to SizeInBytesOnlyStatsPlanVisitor with a vastly inflated estimate.

I wrote a simplified reproduction test using TPC-DS sf100 stats to measure the actual impact:

Metric BEFORE (Agg(Join(Union, dd))) AFTER (Agg(Union(Join, Join)))

d_week_seq distinctCount Some(10010) None (lost by Union)

d_week_seq hasCountStats true false

Aggregate rowCount Some(10010) None (CBO failed)

Aggregate sizeInBytes 195KB 4.1GB (~21,000x inflation)

This inflated estimate (4.1GB) far exceeds the broadcast threshold (default 10MB), causing the top-level self-join (year-over-year comparison) to fall from BroadcastHashJoin to SortMergeJoin.

Notably, I think this UnionEstimation gap is pre-existing — any GROUP BY ... FROM (... UNION ALL ...) pattern with CBO column stats will lose distinctCount through the Union. I can try to create a case to reproduce this issue without this pr and attempt to fix it separately first

@peter-toth I have submitted a pull request at #54883 in an attempt to optimize the issues mentioned earlier.

Thanks @LuciferYang , I can check it tomorrow.

I performed a rebase, and it seems that the SMJ has reverted back to a BHJ.

The change makes sense to me, but let me check this PR thoroughly Monday.

yaooqinn · 2026-03-17T16:02:28Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushDownJoinThroughUnion.scala

+    ) match {
+      case Join(_, deduped, _, _, _) => deduped
+      case other =>
+        throw SparkException.internalError(


Any other optimization through bug-like errors?

Thanks @yaooqinn. Yes, SparkException.internalError is used in several optimizer rules as a defensive guard for "should-never-happen" plan shapes, for example:

NestedColumnAliasing: "Unreasonable plan after optimization: $other"

PushExtraPredicateThroughJoin / Optimizer: "Unexpected join type: $other"

DecorrelateInnerQuery: "Unexpected domain join type $o"

subquery.scala: "Unexpected plan when optimizing one row relation subquery: $o"

The dedupRight method here follows the same pattern — it guards against the (theoretically impossible) case where DeduplicateRelations changes the Join plan shape.

That said, InlineCTE uses the same "fake self-join + DeduplicateRelations" approach and simply calls .children(1) directly without any defensive check. I can align with InlineCTE and remove the explicit throw if you think that's cleaner. Alternatively, I could keep the pattern match but return the original plan unchanged in the fallback case (skipping the dedup rather than failing). Which approach would you prefer?

yaooqinn · 2026-03-17T16:03:12Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushDownJoinThroughUnion.scala

+      hint: JoinHint): Boolean = {
+    canBroadcastBySize(right, conf) ||
+      hint.rightHint.exists(_.strategy.contains(BROADCAST)) ||
+      (joinType == Inner && hint.leftHint.exists(_.strategy.contains(BROADCAST)))


the right side is broadcastable

Is this out-of-scope?

Good catch @yaooqinn. You're right — this leftHint check on line 111 is problematic and should be removed.

LuciferYang · 2026-03-17T17:57:45Z

@dongjoon-hyun @yaooqinn @peter-toth Thank you for your comments. I will carefully review the issues mentioned tomorrow.

LuciferYang · 2026-03-18T05:29:10Z

I might be missing something but I don't get this part:

Without this optimization, Spark must first shuffle the entire Union result before performing the join. With this rule, each Union branch joins independently with the broadcasted right side, eliminating the expensive shuffle of the Union result.

Why Spark needs to shuffle the union result if the right side is small enough to be broadcasted (i.e. the original join was a broadcast join)? Is there a TPCDS plan where an exchange is removed by this PR?

Thanks @peter-toth, you are absolutely right. The PR description was incorrect — since the right side is already broadcastable, the original join is a BroadcastHashJoin, and the Union result (left side) is not shuffled. I apologize for the misleading description.

peter-toth · 2026-03-18T13:33:59Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/PushDownJoinThroughUnion.scala

+      if conf.getConf(SQLConf.PUSH_DOWN_JOIN_THROUGH_UNION_ENABLED) &&
+        (joinType == Inner || joinType == LeftOuter) &&
+        joinCond.isDefined &&
+        isBroadcastable(joinType, right, hint) &&


In PushDownLeftSemiAntiJoin we use canPlanAsBroadcastHashJoin(), can you please check if we could use that here as well?

I think it's feasible. Let's give it a test.

init

d9905e6

LuciferYang marked this pull request as draft March 17, 2026 12:42

LuciferYang commented Mar 17, 2026

View reviewed changes

LuciferYang marked this pull request as ready for review March 17, 2026 12:44

peter-toth reviewed Mar 17, 2026

View reviewed changes

yaooqinn reviewed Mar 17, 2026

View reviewed changes

LuciferYang marked this pull request as draft March 17, 2026 17:44

peter-toth reviewed Mar 18, 2026

View reviewed changes

LuciferYang added 5 commits March 18, 2026 22:21

use canPlanAsBroadcastHashJoin

6451eb8

remove used PredicateHelper

a87b297

Merge branch 'upmaster' into SPARK-56034

9a76211

remove config

35b2c58

cleanup unnecessary code comments

43f120a

Metric	BEFORE (`Agg(Join(Union, dd))`)	AFTER (`Agg(Union(Join, Join))`)
`d_week_seq` distinctCount	`Some(10010)`	`None` (lost by Union)
`d_week_seq` hasCountStats	`true`	`false`
Aggregate rowCount	`Some(10010)`	`None` (CBO failed)
Aggregate sizeInBytes	195KB	4.1GB (~21,000x inflation)

Conversation

LuciferYang commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Mar 17, 2026

Uh oh!

peter-toth commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LuciferYang Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LuciferYang commented Mar 17, 2026

Uh oh!

LuciferYang commented Mar 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

LuciferYang commented Mar 17, 2026 •

edited

Loading

peter-toth commented Mar 17, 2026 •

edited

Loading

LuciferYang Mar 18, 2026 •

edited

Loading