-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Task Description
What needs to be done:
Up to this point, shredding columns must be explicitly defined provided that #18037 is implemented.
We will need to add auto inference logic for shredding such that shredding fields can be determined implicitly.
Refer to how Spark does it here:
https://github.com/apache/spark/blob/61d4581b7b2ada0efbf93f682c39c73b9252df79/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOutputWriterWithVariantShredding.scala#L31-L39
Why this task is needed:
Implementation of this will allow us to explore performance penalties fro shredding jitters for reconciling differing physical schemas of the Variant columns on the file level during reading and writing.
As such, performance considerations is not within the scope of this task. Instead this task is a prerequisite for the performance consideration.
The scope of this PR is to enable inference logic for shredding and to ensure read and writes for CoW + MoR tables can succeed without any issues.
Task Type
Code improvement/refactoring
Related Issues
Parent feature issue: (if applicable )
Related issues:
NOTE: Use Relationships button to add parent/blocking issues after issue is created.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status