[SPARK-56046][SQL] Typed SPJ partition key `Reducer`s by peter-toth · Pull Request #54884 · apache/spark

peter-toth · 2026-03-18T13:02:04Z

What changes were proposed in this pull request?

This PR adds a new method to SPJ partition key Reducers to return the type of a reduced partition key.

Why are the changes needed?

After the SPJ refactor some Iceberg SPJ tests, that join a hours transform partitioned table with a days transform partitioned table, started to fail. This is because after the refactor the keys of a KeyedPartitioning partitioning are InternalRowComparableWrappers, which include the type of the key, and when the partition keys are reduced the type of the reduced keys are inherited from their original type.

[SPARK-55535][SPARK-55092][SQL] Refactor KeyGroupedPartitioning and Storage Partition Join #54330

This means that when hours transformed hour keys are reduced to days, the keys actually remain having IntegerType type, while the days transformed keys have DateType type in Iceberg. This type difference causes that the left and right side InternalRowComparableWrappers are not considered equal despite their InternalRow raw key data are equal.

Before the refactor the type of (possibly reduced) partition keys were not stored in the partitioning. When the left and right side raw keys were compared in EnsureRequirement a common comparator was initialized with the type of the left side keys.
So in the Iceberg SPJ tests the IntegerType keys were forced to be interpreted as DateType, or the DateType keys were forced to be interpreted as IntegerType, depending on the join order of the tables.
The reason why this was not causing any issues is that the PhysicalDataType of both DateType and IntegerType logical types is PhysicalIntegerType.

This PR introduces a new resultType() method of Reducer to return the correct type of the reduced keys and properly compares the left and right side reduced key types and thorws an error when they are not the same.

Does this PR introduce any user-facing change?

Yes, the reduced key types are now properly compared and incompatibilities are reported to users.

How was this patch tested?

Added new UTs.

Was this patch authored or co-authored using generative AI tooling?

No.

peter-toth · 2026-03-18T17:12:02Z

...ore/src/test/scala/org/apache/spark/sql/connector/catalog/functions/transformFunctions.scala

+object YearsFunction extends ScalarFunction[Int] with ReducibleFunction[Int, Int] {
  override def inputTypes(): Array[DataType] = Array(TimestampType)
-  override def resultType(): DataType = LongType
+  override def resultType(): DataType = IntegerType


I changed the test years transform to return IntegerType and the test days transform to return DateType logical types, because those 2 differ but have the same PhysicalIntegerType physical type.
I also made days reducible to years, which is very similar to what Iceberg can do with hours and days.

peter-toth · 2026-03-18T17:20:17Z

cc @szehon-ho , @dongjoon-hyun

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/Reducer.java

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

dongjoon-hyun · 2026-03-18T18:29:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .createWithDefault(false)

+  val V2_BUCKETING_ALLOW_INCOMPATIBLE_TRANSFORM_TYPES =
+    buildConf("spark.sql.sources.v2.bucketing.allowIncompatibleTransformTypes.enabled")


Do you think we can set this this configuration false for some cases in the future, @peter-toth ? I'm a little confused when it makes senses that we are going to disallow incompatible transform types.

This is a good question and I too was thinking about it. I feel we should not compare different logical types due to their different semantical meanings, but seemingly this is what we do currently in some cases, so we should probably keep the behavior for now. I think in a future Spark release we can change this config to make sure a comparison makes sense.

yea im also thinking, if there is some dangerous discrepancy now , it is worth a behave change to fix it.

The only consumer that i know of is Iceberg , which has hoursToDay reducer that changes type, and bucketReducer (which doesnt change type). Iceberg will need to recompile against Spark 4.2 anyway so it's an opportunity for us to fix it there.

wdyt (as regards to the Spark release policy) ?

Yeah, very likely Iceberg is the only project that implemented reducers.

If we are ok with fixing the issue in Iceberg then probably we don't need the latest commit, but we can keep resultType() in Reducer, remove its default value and drop this config.

I'm actively testing Spark 4.2.0 integration in Iceberg. The issue was only in 4.2.0-preview3 and I can work on the Iceberg changes for next preview release. +1 to drop this config.

In effect, every Iceberg release has jars for spark 4.0, 4.1, 4.2, etc.

So in effect, Iceberg 1.11 (or Iceberg 1.12) with Spark 4.2 will be a new jar, and it can start fresh (the affected Reducer will just implement the new interface). As per Iceberg release policy, old Iceberg branches (ie, 1.10 / 1.09 /etc) will never have Spark 4.2 support. So I still feel a bit overkill here, but will defer if you feel strongly

keep resultType() in Reducer, remove its default value and drop this config.

I prefer this way, the Reducer is marked as @Evolving, so I suppose such a change is acceptable.

from the connector developer's perspective, two interfaces introduce an additional understanding cost (the reducer API is already a relatively complex part compared to other DSv2 APIs). in addition, adding a method to the interface makes developers able to write an impl that compiles against Spark 4.2 and also keeps binary compatibility with Spark 4.1

In that case, I'm fine either way. Feel free to choose. :)

Changed the implementation in b82df20.

~~@manuzhang as you are working on Spark 4.2 snapshot on Iceberg side, let's make sure the existing Reducer implement the new method~~

edit: sorry, just saw your comment above

dongjoon-hyun · 2026-03-18T18:29:36Z

cc @aokolnychyi , @cloud-fan , @gengliangwang , too.

...ore/src/test/scala/org/apache/spark/sql/connector/catalog/functions/transformFunctions.scala

dongjoon-hyun · 2026-03-18T19:19:22Z

Thank you for the catching this and providing a fix promptly, @peter-toth .
I'll leave this to the other reviewers.

gengliangwang · 2026-03-18T20:52:17Z

cc @szehon-ho as well

szehon-ho · 2026-03-18T21:55:04Z

im taking a look, thanks

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/Reducer.java

pan3793 · 2026-03-19T02:46:44Z

Properly compares the left and right side reduced key types and return an error when they are not the same.

the previously always using the left side key type behavior is indeed problematic, but the new rule looks too strict, is it possible to follow the behavior of join key type mismatch handling?

When a join has an EqualTo(leftKey, rightKey) condition where types differ, ImplicitTypeCoercion kicks in:

Calls findTightestCommonType(left.dataType, right.dataType) to find a compatible type
Wraps operands in Cast expressions to coerce both to the common type

peter-toth · 2026-03-19T09:10:13Z

When a join has an EqualTo(leftKey, rightKey) condition where types differ, ImplicitTypeCoercion kicks in:

Calls findTightestCommonType(left.dataType, right.dataType) to find a compatible type

Wraps operands in Cast expressions to coerce both to the common type

I think this is a bit different issue to type coercion as the ReducibleFunctions on both sides know each other when they return the Reducers. The Reducers' responsibility to produce comparable reduced values. The only issue now is that we don't know the type of those values.

pan3793 · 2026-03-19T12:04:59Z

The Reducers' responsibility to produce comparable reduced values.

@peter-toth, this sounds reasonable, maybe we should emphasize that in the javadocs? the = check requires exactly both value and data type match

... r(f_source(x)) = f_target(x) ...
... r1(f_source(x)) = r2(f_target(x)) ...

peter-toth · 2026-03-19T17:42:18Z

The Reducers' responsibility to produce comparable reduced values.

@peter-toth, this sounds reasonable, maybe we should emphasize that in the javadocs? the = check requires exactly both value and data type match

... r(f_source(x)) = f_target(x) ...

... r1(f_source(x)) = r2(f_target(x)) ...

Fixed in 595d59e.

dongjoon-hyun · 2026-03-19T19:47:35Z

+1 from my side. ~~It would be great if we can have an item in the SQL migration guide as mentioned #54884 (comment) .~~

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/Reducer.java

szehon-ho · 2026-03-19T22:27:56Z

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala

+            !conf.v2BucketingAllowIncompatibleTransformTypes ||
+              leftReducedDataTypes.map(PhysicalDataType(_)) !=
+                rightReducedDataTypes.map(PhysicalDataType(_)))) {
+            throw new SparkException("Storage-partition join partition transforms produced " +


is there any error code/class? Again feel its overkill, but maybe we should do it if we keep this approach. Also maybe we can use Reducer.displayName

Extracted in this commit: fbf2630.

...ore/src/test/scala/org/apache/spark/sql/connector/catalog/functions/transformFunctions.scala

peter-toth · 2026-03-20T15:43:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

+   * Returns the reduced keys and their data types.
   */
-  def reduceKeys(reducers: Seq[Option[Reducer[_, _]]]): Seq[InternalRowComparableWrapper] =
-    KeyedPartitioning.reduceKeys(partitionKeys, expressionDataTypes, reducers).distinct


This .distinct was moved to mergeAndDedupPartitions().

peter-toth · 2026-03-20T15:47:46Z

I changed the implementation in b82df20 as we dicussed in this thread: #54884 (comment) and updated the PR description.

dongjoon-hyun

+1 once more. :)

szehon-ho

LGTM, thanks! Left one comment, but feel free to merge as its quite minor

szehon-ho · 2026-03-20T21:13:37Z

sql/core/src/test/scala/org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala

+        Seq(
+          s"testcat.ns.$items i JOIN testcat.ns.$purchases p ON p.time = i.arrive_time",
+          s"testcat.ns.$purchases p JOIN testcat.ns.$items i ON i.arrive_time = p.time"
+        ).foreach { joinSting =>


typo: joinString

gengliangwang

Clean PR — the architecture is well-designed and the type information flows correctly through all layers (Reducer → KeyedPartitioning → EnsureRequirements/GroupPartitionsExec).

gengliangwang · 2026-03-20T21:42:41Z

...ore/src/test/scala/org/apache/spark/sql/connector/catalog/functions/transformFunctions.scala

+  override def canonicalName(): String = name()
+}
+
+// This `days` function reduces `DateType` partitions keys to `IntegerType` partitions keys when


Typo: "partitions keys" → "partition keys"

Suggested change

// This `days` function reduces `DateType` partitions keys to `IntegerType` partitions keys when

// This `days` function reduces `DateType` partition keys to `IntegerType` partition keys when

gengliangwang · 2026-03-20T21:42:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala

   * @param leftPartitionKeys left side partition keys
   * @param rightPartitionKeys right side partition keys
   * @param joinType join type for optional partition filtering
   * @keyOrdering ordering to sort partition keys


Pre-existing, but since this doc block was updated: @keyOrdering is not a valid Scaladoc tag.

Suggested change

* @keyOrdering ordering to sort partition keys

* @param keyOrdering ordering to sort partition keys

[SPARK-56046][SQL] Typed SPJ partition key reducers

fa4bce7

peter-toth force-pushed the SPARK-56046-typed-spj-reducers branch from 580ca49 to fa4bce7 Compare March 18, 2026 13:54

peter-toth marked this pull request as draft March 18, 2026 16:24

peter-toth added 2 commits March 18, 2026 17:52

fix config

494b923

fix expected ordering type of years transform

e31b361

peter-toth commented Mar 18, 2026

View reviewed changes

peter-toth marked this pull request as ready for review March 18, 2026 17:20

dongjoon-hyun reviewed Mar 18, 2026

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/Reducer.java Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Mar 18, 2026

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Mar 18, 2026

View reviewed changes

address review comments

a00c069

dongjoon-hyun reviewed Mar 18, 2026

View reviewed changes

...ore/src/test/scala/org/apache/spark/sql/connector/catalog/functions/transformFunctions.scala Show resolved Hide resolved

szehon-ho reviewed Mar 19, 2026

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/Reducer.java Outdated Show resolved Hide resolved

manuzhang mentioned this pull request Mar 19, 2026

Spark: Add support for 4.2.0-preview apache/iceberg#14984

Draft

peter-toth added 2 commits March 19, 2026 08:52

Merge branch 'master' into SPARK-56046-typed-spj-reducers

a43f4b4

Extract TypedReducer from Reducer

c20b301

address review comments

595d59e

dongjoon-hyun approved these changes Mar 19, 2026

View reviewed changes

szehon-ho reviewed Mar 19, 2026

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/functions/Reducer.java Outdated Show resolved Hide resolved

szehon-ho reviewed Mar 19, 2026

View reviewed changes

...ore/src/test/scala/org/apache/spark/sql/connector/catalog/functions/transformFunctions.scala Outdated Show resolved Hide resolved

peter-toth added 3 commits March 20, 2026 15:43

simplify solution

b82df20

fix typo

39f7c60

extract error class

fbf2630

peter-toth commented Mar 20, 2026

View reviewed changes

Merge branch 'master' into SPARK-56046-typed-spj-reducers

7253d48

dongjoon-hyun approved these changes Mar 20, 2026

View reviewed changes

szehon-ho approved these changes Mar 20, 2026

View reviewed changes

gengliangwang reviewed Mar 20, 2026

View reviewed changes

	// This `days` function reduces `DateType` partitions keys to `IntegerType` partitions keys when
	// This `days` function reduces `DateType` partition keys to `IntegerType` partition keys when

	* @keyOrdering ordering to sort partition keys
	* @param keyOrdering ordering to sort partition keys

Conversation

peter-toth commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

peter-toth Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth commented Mar 18, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Mar 18, 2026

Uh oh!

Uh oh!

dongjoon-hyun commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gengliangwang commented Mar 18, 2026

Uh oh!

szehon-ho commented Mar 18, 2026

Uh oh!

Uh oh!

pan3793 commented Mar 19, 2026

Uh oh!

peter-toth commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pan3793 commented Mar 19, 2026

Uh oh!

peter-toth commented Mar 19, 2026

Uh oh!

dongjoon-hyun commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

szehon-ho Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

peter-toth commented Mar 18, 2026 •

edited

Loading

peter-toth Mar 18, 2026 •

edited

Loading

szehon-ho Mar 19, 2026 •

edited

Loading

szehon-ho Mar 20, 2026 •

edited

Loading

dongjoon-hyun commented Mar 18, 2026 •

edited

Loading

peter-toth commented Mar 19, 2026 •

edited

Loading

dongjoon-hyun commented Mar 19, 2026 •

edited

Loading

szehon-ho Mar 19, 2026 •

edited

Loading

peter-toth commented Mar 20, 2026 •

edited

Loading