[SPARK-37019][SQL] Add codegen support to array transform by Kimahriman · Pull Request #34294 · apache/spark

Kimahriman · 2021-10-15T11:51:00Z

What changes were proposed in this pull request?

This PR adds codegen support to ArrayTransform. This is my first time playing around with codegen, so definitely looking for any feedback. I ran into several issues along the way which you'll see in some checks I had to add. Specifically:

I initially added lambda variable tracking to the codegen context, to make sure a function can't be split out while there is any active lambda variables and using local project values for the lambda variables. But this lead to a function to large issue when testing it in production code so I changed it to just use everything off of the reference to the atomic variable
Made sure lambda functions themselves can never be considered for subexpression elimination

Questions I have:

Does it make sense to only use the reference to the atomic variable or should I try to add a mutable state or something for less function calls in the happy non-fallback case?
Any other weird corner cases I'm not thinking of?

Why are the changes needed?

To improve performance of transform operations, letting the children be codegen'd and participate in WholeStageCodegen

Does this PR introduce any user-facing change?

No, only performance improvements.

How was this patch tested?

Existing unit tests, let me know if there's other codegen-specific unit tests I should add.

Kimahriman · 2021-10-15T11:55:53Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/higherOrderFunctions.scala

    result
  }

+  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {


Some of this can probably be abstracted out into the parent traits, but I figured that will be easier to do when implementing a second function

AmplabJenkins · 2021-10-15T12:22:20Z

Can one of the admins verify this patch?

Kimahriman · 2021-10-22T12:27:53Z

I have array filter working already too, just waiting for some feedback on this.

Kimahriman · 2021-10-22T12:28:04Z

Codegen example for df.selectExpr("transform(i, (x, i) -> x + i)"):

/* 038 */       if (!localtablescan_isNull_0) {
/* 039 */         final int project_numElements_0 = localtablescan_value_0.numElements();
/* 040 */
/* 041 */         ArrayData project_arrayData_0 = ArrayData.allocateArrayData(
/* 042 */           4, project_numElements_0, " transform failed.");
/* 043 */
/* 044 */         for (int project_i_0 = 0; project_i_0 < project_numElements_0; project_i_0++) {
/* 045 */           int lambda_x_6 = localtablescan_value_0.getInt(project_i_0);
/* 046 */           boolean lambda_x_6_isNull = localtablescan_value_0.isNullAt(project_i_0);
/* 047 */           ((java.util.concurrent.atomic.AtomicReference) references[1] /* x_6 */).set(lambda_x_6);
/* 048 */
/* 049 */           int lambda_i_7 = project_i_0;
/* 050 */           ((java.util.concurrent.atomic.AtomicReference) references[2] /* i_7 */).set(lambda_i_7);
/* 051 */
/* 052 */           boolean project_isNull_3 = true;
/* 053 */           int project_value_3 = -1;
/* 054 */
/* 055 */           if (!lambda_x_6_isNull) {
/* 056 */             project_isNull_3 = false; // resultCode could change nullability.
/* 057 */
/* 058 */             project_value_3 = lambda_x_6 + lambda_i_7;
/* 059 */
/* 060 */           }
/* 061 */           boolean project_isNull_2 = project_isNull_3;
/* 062 */           int project_value_2 = project_value_3;
/* 063 */
/* 064 */           if (project_isNull_2) {
/* 065 */             project_arrayData_0.setNullAt(project_i_0);
/* 066 */           } else {
/* 067 */             project_arrayData_0.setInt(project_i_0, project_value_2);
/* 068 */           }
/* 069 */
/* 070 */         }
/* 071 */         project_value_0 = project_arrayData_0;
/* 072 */
/* 073 */       }

Kimahriman · 2021-11-01T23:33:08Z

Already ran into an issue testing this out in a real use case where making it non-splittable caused a function to grow to big and revert back to interpreted, so I changed to just always use the atomic variable reference to read the value so it can be splittable. New sample codegen:

/* 039 */         final int project_numElements_0 = localtablescan_value_0.numElements();
/* 040 */
/* 041 */         ArrayData project_arrayData_0 = ArrayData.allocateArrayData(
/* 042 */           4, project_numElements_0, " transform failed.");
/* 043 */
/* 044 */         for (int project_i_0 = 0; project_i_0 < project_numElements_0; project_i_0++) {
/* 045 */           if (localtablescan_value_0.isNullAt(project_i_0)) {
/* 046 */             ((java.util.concurrent.atomic.AtomicReference) references[2] /* x_229 */).set(null);
/* 047 */           } else {
/* 048 */             ((java.util.concurrent.atomic.AtomicReference) references[2] /* x_229 */).set(localtablescan_value_0.getInt(project_i_0));
/* 049 */           }
/* 050 */
/* 051 */           boolean project_isNull_3 = true;
/* 052 */           int project_value_3 = -1;
/* 053 */           Object project_tmpAtomic_0 = ((java.util.concurrent.atomic.AtomicReference) references[1] /* x_229 */).get();
/* 054 */           int project_value_4 = -1;
/* 055 */           boolean project_isNull_4 = project_tmpAtomic_0 == null;
/* 056 */           if (!project_isNull_4) {
/* 057 */             project_value_4 = (Integer)project_tmpAtomic_0;
/* 058 */           }
/* 059 */           if (!project_isNull_4) {
/* 060 */             project_isNull_3 = false; // resultCode could change nullability.
/* 061 */
/* 062 */             project_value_3 = project_value_4 + 1;
/* 063 */
/* 064 */           }
/* 065 */           boolean project_isNull_2 = project_isNull_3;
/* 066 */           int project_value_2 = project_value_3;
/* 067 */
/* 068 */           if (project_isNull_2) {
/* 069 */             project_arrayData_0.setNullAt(project_i_0);
/* 070 */           } else {
/* 071 */             project_arrayData_0.setInt(project_i_0, project_value_2);
/* 072 */           }
/* 073 */
/* 074 */         }
/* 075 */         project_value_0 = project_arrayData_0;
/* 076 */
/* 077 */       }

Kimahriman · 2021-11-01T23:34:14Z

ping @viirya @HyukjinKwon since you two usually look at my PRs or might know the right other people to ping for thoughts :)

Kimahriman · 2021-11-11T13:55:25Z

Closing in favor of #34558

Add codegen support to array transform

386fc9c

github-actions bot added the SQL label Oct 15, 2021

Kimahriman commented Oct 15, 2021

View reviewed changes

Always use reference to atomic var

323a1fa

Kimahriman closed this Nov 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-37019][SQL] Add codegen support to array transform#34294

[SPARK-37019][SQL] Add codegen support to array transform#34294
Kimahriman wants to merge 2 commits intoapache:masterfrom
Kimahriman:array-transform-codegen

Kimahriman commented Oct 15, 2021 •

edited

Loading

Uh oh!

Kimahriman Oct 15, 2021

Uh oh!

AmplabJenkins commented Oct 15, 2021

Uh oh!

Kimahriman commented Oct 22, 2021

Uh oh!

Kimahriman commented Oct 22, 2021 •

edited

Loading

Uh oh!

Kimahriman commented Nov 1, 2021

Uh oh!

Kimahriman commented Nov 1, 2021

Uh oh!

Kimahriman commented Nov 11, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Kimahriman commented Oct 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Kimahriman Oct 15, 2021

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Oct 15, 2021

Uh oh!

Kimahriman commented Oct 22, 2021

Uh oh!

Kimahriman commented Oct 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kimahriman commented Nov 1, 2021

Uh oh!

Kimahriman commented Nov 1, 2021

Uh oh!

Kimahriman commented Nov 11, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Kimahriman commented Oct 15, 2021 •

edited

Loading

Kimahriman commented Oct 22, 2021 •

edited

Loading