Skip to content

Spark: Validate Z-order rewrite does not conflict with internal ICEZVALUE column name#15706

Open
YanivZalach wants to merge 1 commit intoapache:mainfrom
YanivZalach:fix/zorder-icezvalue-column-collision
Open

Spark: Validate Z-order rewrite does not conflict with internal ICEZVALUE column name#15706
YanivZalach wants to merge 1 commit intoapache:mainfrom
YanivZalach:fix/zorder-icezvalue-column-collision

Conversation

@YanivZalach
Copy link

@YanivZalach YanivZalach commented Mar 20, 2026

Tables with a column named ICEZVALUE fail with a misleading error during
Z-order rewrite because SparkZOrderFileRewriteRunner internally uses a
column with that name.

Adds an early Preconditions.checkArgument in SparkZOrderFileRewriteRunner
before any DataFrame operation, throwing a clear exception
if the table schema already contains a column named ICEZVALUE.

Closes #15708

@YanivZalach YanivZalach force-pushed the fix/zorder-icezvalue-column-collision branch 4 times, most recently from a6adec9 to d4e22e5 Compare March 20, 2026 22:33
@YanivZalach YanivZalach force-pushed the fix/zorder-icezvalue-column-collision branch from d4e22e5 to 48cabea Compare March 21, 2026 00:38
@wypoon
Copy link
Contributor

wypoon commented Mar 21, 2026

I think that it would be better to use a configurable column name than a hardcoded "ICEZVALUE" in SparkZOrderFileRewriteRunner.
We could add an option to https://iceberg.apache.org/docs/latest/spark-procedures/#options-for-sort-strategy-with-zorder-sort_order, for example, for configuring this column name. What do you think, @RussellSpitzer?
We should still throw the error as proposed here if the column coincides with an existing column in the table.

@wypoon
Copy link
Contributor

wypoon commented Mar 21, 2026

For ease of review, it is better to make the changes only to the latest supported version of Spark, and then backport the changes to earlier versions once the PR is merged.

@RussellSpitzer
Copy link
Member

I think that it would be better to use a configurable column name than a hardcoded "ICEZVALUE" in SparkZOrderFileRewriteRunner.

We could add an option to https://iceberg.apache.org/docs/latest/spark-procedures/#options-for-sort-strategy-with-zorder-sort_order, for example, for configuring this column name. What do you think, @RussellSpitzer?

We should still throw the error as proposed here if the column coincides with an existing column in the table.

I really don't think we need customization here, is this really that likely to be conflicting?

If we are really worried I would just randomly change the name if the column exists. My hesitation on all of this is that it doesn't seem like a very likely situation so while it is true it could happen is it worth adding code for the very rare chance it does?

So in terms of solutions I would consider

  1. Randomly assign the column name (gen_zorder_xx) where if for some reason that exists we randomize the xx to something else
  2. Just fail if the chosen column name is already in use
  3. Let users customize it

The order here is

  1. No user intervention needed, we hide the problem
  2. We expose that it's a problem
  3. We expose it's a problem and have to support a custom new parameter just for this special case

For all of these we still have the question of, "is it really worth adding more code to the project to avoid this"

@YanivZalach
Copy link
Author

I agree that option 2 (fail with a clear error) is the right tradeoff here.
The name ICEZVALUE is unlikely to conflict in practice, but when it does,
the current behavior is silent data modification, followed by a completely
misleading error that gives the user no indication of what went wrong or how
to fix it, and when a rewrite did not happen because of the thresholds, no error at all.

The value of this PR is not preventing a common case — it's making a rare but
confusing failure mode diagnosable.

@RussellSpitzer
Copy link
Member

RussellSpitzer commented Mar 21, 2026

Explain why not just auto fixing it or choosing a unique name for the column isn't better.

Express your opinion in a haiku

@YanivZalach
Copy link
Author

YanivZalach commented Mar 21, 2026

I agree with you, and see a lot of value in consistent errors myself.

The haiku:

Fixed names shall stay true,

Clear errors guide the user,

Keep the system clean

@RussellSpitzer
Copy link
Member

Iambic Pentameter, would also be clearer for end users.

@RussellSpitzer
Copy link
Member

More seriously, I think the clearer message is probably fine , I think in reality there is almost zero chance this code path is ever executed, but if allow any schema is set and the user defined icezvalue is optional, this could silently erase that value. I doubt this will ever happen but I'll accept a precondition and test. It shouldn't add that much burden

@YanivZalach YanivZalach requested a review from wypoon March 21, 2026 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Spark: Z-order rewrite fails with misleading error when table has a column named ICEZVALUE

3 participants