Skip to content

[SPARK-56030] [SQL] Normalize inner Project when below Project -> Filter#54859

Open
mihailoale-db wants to merge 1 commit intoapache:masterfrom
mihailoale-db:normalizeproject1
Open

[SPARK-56030] [SQL] Normalize inner Project when below Project -> Filter#54859
mihailoale-db wants to merge 1 commit intoapache:masterfrom
mihailoale-db:normalizeproject1

Conversation

@mihailoale-db
Copy link
Contributor

What changes were proposed in this pull request?

In this PR I propose to normalize inner Project when below Project -> Filter.

Why are the changes needed?

Right now

SELECT a.key, b.key AS key_b
FROM t1 b
FULL JOIN t2 a USING (key)
WHERE b.key IS NULL OR a.key IS NULL

throws LOGICAL_PLAN_COMPARISON_MISMATCH for analyzer dual-runs and normalizing would fix it.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added + existing tests.

Was this patch authored or co-authored using generative AI tooling?

Yes.

@mihailoale-db
Copy link
Contributor Author

@dtenedor PTAL when you find time. Thanks!

@mihailoale-db
Copy link
Contributor Author

@cloud-fan PTAL when you find time. Thanks!

case project @ Project(_, innerAggregate: Aggregate) =>
project.copy(child = normalizeAggregateListOrder(innerAggregate))

case project @ Project(_, filter @ Filter(_, innerProject: Project)) =>
Copy link
Contributor

@cloud-fan cloud-fan Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make it more general? we can turn transformUpWithSubqueries into a manual top-down recursion. When we hit a project, we set a bool flag projectListOrderSensitive to false when invoking the recursive method, and we do not set projectListOrderSensitive back to true for a whitelist of plan nodes: Filter, Sample, Sort, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about that. It's not just Filter, Sort, Sample and some small set of operators but all the operators that have override def output: Seq[Attribute] = child.output. I don't think that it's robust in a way that you can easily miss some and besides that if someone adds another operator there is 99% chance that they will forget to add it to the list. So I propose to go with this approach and think of a better solution in a background in order to avoid polluting NormalizePlan too much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An allowlist solution is better than the current hardcoded single filter hack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants