-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-56046][SQL] Typed SPJ partition key Reducers
#54884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
peter-toth
wants to merge
11
commits into
apache:master
Choose a base branch
from
peter-toth:SPARK-56046-typed-spj-reducers
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+252
−44
Open
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
fa4bce7
[SPARK-56046][SQL] Typed SPJ partition key reducers
peter-toth 494b923
fix config
peter-toth e31b361
fix expected ordering type of years transform
peter-toth a00c069
address review comments
peter-toth a43f4b4
Merge branch 'master' into SPARK-56046-typed-spj-reducers
peter-toth c20b301
Extract `TypedReducer` from `Reducer`
peter-toth 595d59e
address review comments
peter-toth b82df20
simplify solution
peter-toth 39f7c60
fix typo
peter-toth fbf2630
extract error class
peter-toth 7253d48
Merge branch 'master' into SPARK-56046-typed-spj-reducers
peter-toth File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -27,6 +27,7 @@ import org.apache.spark.sql.catalyst.plans.physical._ | |||||
| import org.apache.spark.sql.catalyst.rules.Rule | ||||||
| import org.apache.spark.sql.catalyst.util.InternalRowComparableWrapper | ||||||
| import org.apache.spark.sql.connector.catalog.functions.Reducer | ||||||
| import org.apache.spark.sql.errors.QueryExecutionErrors | ||||||
| import org.apache.spark.sql.execution._ | ||||||
| import org.apache.spark.sql.execution.datasources.v2.GroupPartitionsExec | ||||||
| import org.apache.spark.sql.execution.joins.{ShuffledHashJoinExec, SortMergeJoinExec} | ||||||
|
|
@@ -509,16 +510,24 @@ case class EnsureRequirements( | |||||
| // in case of compatible but not identical partition expressions, we apply 'reduce' | ||||||
| // transforms to group one side's partitions as well as the common partition values | ||||||
| val leftReducers = leftSpec.reducers(rightSpec) | ||||||
| val leftReducedKeys = | ||||||
| leftReducers.fold(leftPartitioning.partitionKeys)(leftPartitioning.reduceKeys) | ||||||
| val rightReducers = rightSpec.reducers(leftSpec) | ||||||
| val rightReducedKeys = | ||||||
| rightReducers.fold(rightPartitioning.partitionKeys)(rightPartitioning.reduceKeys) | ||||||
| val (leftReducedDataTypes, leftReducedKeys) = leftReducers.fold( | ||||||
| (leftPartitioning.expressionDataTypes, leftPartitioning.partitionKeys) | ||||||
| )(leftPartitioning.reduceKeys) | ||||||
| val (rightReducedDataTypes, rightReducedKeys) = rightReducers.fold( | ||||||
| (rightPartitioning.expressionDataTypes, rightPartitioning.partitionKeys) | ||||||
| )(rightPartitioning.reduceKeys) | ||||||
| if (leftReducedDataTypes != rightReducedDataTypes) { | ||||||
| throw QueryExecutionErrors.storagePartitionJoinIncompatibleReducedTypesError( | ||||||
| leftReducers = leftReducers, | ||||||
| leftReducedDataTypes = leftReducedDataTypes, | ||||||
| rightReducers = rightReducers, | ||||||
| rightReducedDataTypes = rightReducedDataTypes) | ||||||
| } | ||||||
|
|
||||||
| // merge values on both sides | ||||||
| var mergedPartitionKeys = | ||||||
| mergePartitions(leftReducedKeys, rightReducedKeys, joinType, leftPartitioning.keyOrdering) | ||||||
| .map((_, 1)) | ||||||
| var mergedPartitionKeys = mergeAndDedupPartitions(leftReducedKeys, rightReducedKeys, | ||||||
| joinType, leftPartitioning.keyOrdering).map((_, 1)) | ||||||
|
|
||||||
| logInfo(log"After merging, there are " + | ||||||
| log"${MDC(LogKeys.NUM_PARTITIONS, mergedPartitionKeys.size)} partitions") | ||||||
|
|
@@ -752,36 +761,37 @@ case class EnsureRequirements( | |||||
| } | ||||||
|
|
||||||
| /** | ||||||
| * Merge and sort partitions keys for SPJ and optionally enable partition filtering. | ||||||
| * Merge, dedup and sort partitions keys for SPJ and optionally enable partition filtering. | ||||||
| * Both sides must have matching partition expressions. | ||||||
| * @param leftPartitionKeys left side partition keys | ||||||
| * @param rightPartitionKeys right side partition keys | ||||||
| * @param joinType join type for optional partition filtering | ||||||
| * @keyOrdering ordering to sort partition keys | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pre-existing, but since this doc block was updated:
Suggested change
|
||||||
| * @return merged and sorted partition values | ||||||
| */ | ||||||
| def mergePartitions( | ||||||
| def mergeAndDedupPartitions( | ||||||
| leftPartitionKeys: Seq[InternalRowComparableWrapper], | ||||||
| rightPartitionKeys: Seq[InternalRowComparableWrapper], | ||||||
| joinType: JoinType, | ||||||
| keyOrdering: Ordering[InternalRowComparableWrapper]): Seq[InternalRowComparableWrapper] = { | ||||||
| val merged = if (SQLConf.get.getConf(SQLConf.V2_BUCKETING_PARTITION_FILTER_ENABLED)) { | ||||||
| joinType match { | ||||||
| case Inner => mergePartitionKeys(leftPartitionKeys, rightPartitionKeys, intersect = true) | ||||||
| case LeftOuter => leftPartitionKeys | ||||||
| case RightOuter => rightPartitionKeys | ||||||
| case _ => mergePartitionKeys(leftPartitionKeys, rightPartitionKeys) | ||||||
| case Inner => | ||||||
| mergeAndDedupPartitionKeys(leftPartitionKeys, rightPartitionKeys, intersect = true) | ||||||
| case LeftOuter => leftPartitionKeys.distinct | ||||||
| case RightOuter => rightPartitionKeys.distinct | ||||||
| case _ => mergeAndDedupPartitionKeys(leftPartitionKeys, rightPartitionKeys) | ||||||
| } | ||||||
| } else { | ||||||
| mergePartitionKeys(leftPartitionKeys, rightPartitionKeys) | ||||||
| mergeAndDedupPartitionKeys(leftPartitionKeys, rightPartitionKeys) | ||||||
| } | ||||||
|
|
||||||
| // SPARK-41471: We keep to order of partitions to make sure the order of | ||||||
| // partitions is deterministic in different case. | ||||||
| merged.sorted(keyOrdering) | ||||||
| } | ||||||
|
|
||||||
| private def mergePartitionKeys( | ||||||
| private def mergeAndDedupPartitionKeys( | ||||||
| leftPartitionKeys: Seq[InternalRowComparableWrapper], | ||||||
| rightPartitionKeys: Seq[InternalRowComparableWrapper], | ||||||
| intersect: Boolean = false) = { | ||||||
|
|
||||||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -19,7 +19,7 @@ package org.apache.spark.sql.connector | |
| import java.sql.Timestamp | ||
| import java.util.Collections | ||
|
|
||
| import org.apache.spark.SparkConf | ||
| import org.apache.spark.{SparkConf, SparkException} | ||
| import org.apache.spark.sql.{DataFrame, ExplainSuiteHelper, Row} | ||
| import org.apache.spark.sql.catalyst.InternalRow | ||
| import org.apache.spark.sql.catalyst.expressions.{Literal, TransformExpression} | ||
|
|
@@ -75,6 +75,20 @@ class KeyGroupedPartitioningSuite extends DistributionAndOrderingSuiteBase with | |
| Column.create("dept_id", IntegerType), | ||
| Column.create("data", StringType)) | ||
|
|
||
| def withFunction[T](fn: UnboundFunction)(f: => T): T = { | ||
| val id = Identifier.of(Array.empty, fn.name()) | ||
| val oldFn = Option.when(catalog.listFunctions(Array.empty).contains(id)) { | ||
| val fn = catalog.loadFunction(id) | ||
| catalog.dropFunction(id) | ||
| fn | ||
| } | ||
| catalog.createFunction(id, fn) | ||
| try f finally { | ||
| catalog.dropFunction(id) | ||
| oldFn.foreach(catalog.createFunction(id, _)) | ||
| } | ||
| } | ||
|
|
||
| test("clustered distribution: output partitioning should be KeyedPartitioning") { | ||
| val partitions: Array[Transform] = Array(Expressions.years("ts")) | ||
|
|
||
|
|
@@ -88,7 +102,7 @@ class KeyGroupedPartitioningSuite extends DistributionAndOrderingSuiteBase with | |
| var df = sql(s"SELECT count(*) FROM testcat.ns.$table GROUP BY ts") | ||
| val catalystDistribution = physical.ClusteredDistribution( | ||
| Seq(TransformExpression(YearsFunction, Seq(attr("ts"))))) | ||
| val partitionKeys = Seq(50L, 51L, 52L).map(v => InternalRow.fromSeq(Seq(v))) | ||
| val partitionKeys = Seq(50, 51, 52).map(v => InternalRow.fromSeq(Seq(v))) | ||
|
|
||
| checkQueryPlan(df, catalystDistribution, | ||
| physical.KeyedPartitioning(catalystDistribution.clustering, partitionKeys)) | ||
|
|
@@ -3385,4 +3399,83 @@ class KeyGroupedPartitioningSuite extends DistributionAndOrderingSuiteBase with | |
| checkKeywordsExistsInExplain(df, FormattedMode, formattedKeyword) | ||
| } | ||
| } | ||
|
|
||
| test("SPARK-56046: Reducers with same result types") { | ||
| val items_partitions = Array(days("arrive_time")) | ||
| createTable(items, itemsColumns, items_partitions) | ||
| sql(s"INSERT INTO testcat.ns.$items VALUES " + | ||
| s"(0, 'aa', 39.0, cast('2020-01-01' as timestamp)), " + | ||
| s"(1, 'aa', 40.0, cast('2020-01-01' as timestamp)), " + | ||
| s"(2, 'bb', 41.0, cast('2021-01-03' as timestamp)), " + | ||
| s"(3, 'bb', 42.0, cast('2021-01-04' as timestamp))") | ||
|
|
||
| val purchases_partitions = Array(years("time")) | ||
| createTable(purchases, purchasesColumns, purchases_partitions) | ||
| sql(s"INSERT INTO testcat.ns.$purchases VALUES " + | ||
| s"(1, 42.0, cast('2020-01-01' as timestamp)), " + | ||
| s"(5, 44.0, cast('2020-01-15' as timestamp)), " + | ||
| s"(7, 46.5, cast('2021-02-08' as timestamp))") | ||
|
|
||
| withSQLConf( | ||
| SQLConf.V2_BUCKETING_PUSH_PART_VALUES_ENABLED.key -> "true", | ||
| SQLConf.V2_BUCKETING_ALLOW_COMPATIBLE_TRANSFORMS.key -> "true") { | ||
| Seq( | ||
| s"testcat.ns.$items i JOIN testcat.ns.$purchases p ON p.time = i.arrive_time", | ||
| s"testcat.ns.$purchases p JOIN testcat.ns.$items i ON i.arrive_time = p.time" | ||
| ).foreach { joinSting => | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. typo: joinString |
||
| val df = sql( | ||
| s""" | ||
| |${selectWithMergeJoinHint("i", "p")} id, item_id | ||
| |FROM $joinSting | ||
| |ORDER BY id, item_id | ||
| |""".stripMargin) | ||
|
|
||
| val shuffles = collectShuffles(df.queryExecution.executedPlan) | ||
| assert(shuffles.isEmpty, "should not add shuffle for both sides of the join") | ||
| val groupPartitions = collectGroupPartitions(df.queryExecution.executedPlan) | ||
| assert(groupPartitions.forall(_.outputPartitioning.numPartitions == 2)) | ||
|
|
||
| checkAnswer(df, Seq(Row(0, 1), Row(1, 1))) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| test("SPARK-56046: Reducers with different result types") { | ||
| withFunction(UnboundDaysFunctionWithIncompatibleResultTypeReducer) { | ||
| val items_partitions = Array(days("arrive_time")) | ||
| createTable(items, itemsColumns, items_partitions) | ||
| sql(s"INSERT INTO testcat.ns.$items VALUES " + | ||
| s"(0, 'aa', 39.0, cast('2020-01-01' as timestamp)), " + | ||
| s"(1, 'aa', 40.0, cast('2020-01-01' as timestamp)), " + | ||
| s"(2, 'bb', 41.0, cast('2021-01-03' as timestamp)), " + | ||
| s"(3, 'bb', 42.0, cast('2021-01-04' as timestamp))") | ||
|
|
||
| val purchases_partitions = Array(years("time")) | ||
| createTable(purchases, purchasesColumns, purchases_partitions) | ||
| sql(s"INSERT INTO testcat.ns.$purchases VALUES " + | ||
| s"(1, 42.0, cast('2020-01-01' as timestamp)), " + | ||
| s"(5, 44.0, cast('2020-01-15' as timestamp)), " + | ||
| s"(7, 46.5, cast('2021-02-08' as timestamp))") | ||
|
|
||
| withSQLConf( | ||
| SQLConf.V2_BUCKETING_PUSH_PART_VALUES_ENABLED.key -> "true", | ||
| SQLConf.V2_BUCKETING_ALLOW_COMPATIBLE_TRANSFORMS.key -> "true") { | ||
| Seq( | ||
| s"testcat.ns.$items i JOIN testcat.ns.$purchases p ON p.time = i.arrive_time", | ||
| s"testcat.ns.$purchases p JOIN testcat.ns.$items i ON i.arrive_time = p.time" | ||
| ).foreach { joinSting => | ||
| val e = intercept[SparkException] { | ||
| sql( | ||
| s""" | ||
| |${selectWithMergeJoinHint("i", "p")} id, item_id | ||
| |FROM $joinSting | ||
| |ORDER BY id, item_id | ||
| |""".stripMargin).collect() | ||
| } | ||
| assert(e.getMessage.contains( | ||
| "Storage-partition join partition transforms produced incompatible reduced types")) | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This
.distinctwas moved tomergeAndDedupPartitions().