Spark: Validate Z-order rewrite does not conflict with internal ICEZVALUE column name#15706
Spark: Validate Z-order rewrite does not conflict with internal ICEZVALUE column name#15706YanivZalach wants to merge 1 commit intoapache:mainfrom
Conversation
a6adec9 to
d4e22e5
Compare
d4e22e5 to
48cabea
Compare
spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteDataFilesAction.java
Outdated
Show resolved
Hide resolved
|
I think that it would be better to use a configurable column name than a hardcoded "ICEZVALUE" in |
|
For ease of review, it is better to make the changes only to the latest supported version of Spark, and then backport the changes to earlier versions once the PR is merged. |
I really don't think we need customization here, is this really that likely to be conflicting? If we are really worried I would just randomly change the name if the column exists. My hesitation on all of this is that it doesn't seem like a very likely situation so while it is true it could happen is it worth adding code for the very rare chance it does? So in terms of solutions I would consider
The order here is
For all of these we still have the question of, "is it really worth adding more code to the project to avoid this" |
|
I agree that option 2 (fail with a clear error) is the right tradeoff here. The value of this PR is not preventing a common case — it's making a rare but |
|
Explain why not just auto fixing it or choosing a unique name for the column isn't better. Express your opinion in a haiku |
|
I agree with you, and see a lot of value in consistent errors myself. The haiku: Fixed names shall stay true, Clear errors guide the user, Keep the system clean |
|
Iambic Pentameter, would also be clearer for end users. |
|
More seriously, I think the clearer message is probably fine , I think in reality there is almost zero chance this code path is ever executed, but if allow any schema is set and the user defined icezvalue is optional, this could silently erase that value. I doubt this will ever happen but I'll accept a precondition and test. It shouldn't add that much burden |
Tables with a column named
ICEZVALUEfail with a misleading error duringZ-order rewrite because
SparkZOrderFileRewriteRunnerinternally uses acolumn with that name.
Adds an early
Preconditions.checkArgumentinSparkZOrderFileRewriteRunnerbefore any DataFrame operation, throwing a clear exception
if the table schema already contains a column named
ICEZVALUE.Closes #15708