Skip to content

Add auto shredding field inference logic #18038

@voonhous

Description

@voonhous

Task Description

What needs to be done:
Up to this point, shredding columns must be explicitly defined provided that #18037 is implemented.

We will need to add auto inference logic for shredding such that shredding fields can be determined implicitly.

Refer to how Spark does it here:
https://github.com/apache/spark/blob/61d4581b7b2ada0efbf93f682c39c73b9252df79/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOutputWriterWithVariantShredding.scala#L31-L39

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/InferVariantShreddingSchema.scala

Why this task is needed:

Implementation of this will allow us to explore performance penalties fro shredding jitters for reconciling differing physical schemas of the Variant columns on the file level during reading and writing.

As such, performance considerations is not within the scope of this task. Instead this task is a prerequisite for the performance consideration.

The scope of this PR is to enable inference logic for shredding and to ensure read and writes for CoW + MoR tables can succeed without any issues.

Task Type

Code improvement/refactoring

Related Issues

Parent feature issue: (if applicable )
Related issues:
NOTE: Use Relationships button to add parent/blocking issues after issue is created.

Metadata

Metadata

Assignees

No one assigned

    Labels

    type:devtaskDevelopment tasks and maintenance work

    Type

    No type

    Projects

    Status

    Open

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions