feat(parquet): add content defined chunking for arrow writer#9450
Draft
kszucs wants to merge 9 commits intoapache:mainfrom
Draft
feat(parquet): add content defined chunking for arrow writer#9450kszucs wants to merge 9 commits intoapache:mainfrom
kszucs wants to merge 9 commits intoapache:mainfrom
Conversation
…` and use it in content defined chunking
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
Rust implementation of apache/arrow#45360
Traditional Parquet writing splits data pages at fixed sizes, so a single inserted or deleted row causes all subsequent pages to shift — resulting in nearly every byte being re-uploaded to content-addressable storage (CAS) systems. CDC determines page boundaries via a rolling gearhash over column values, so unchanged data produces identical pages across different writes enabling storage cost reductions and faster upload times.
See more details in https://huggingface.co/blog/parquet-cdc
The original C++ implementation apache/arrow#45360
Evaluation tool https://github.com/huggingface/dataset-dedupe-estimator where I already integrated this PR to verify that deduplication effectiveness is on par with parquet-cpp (lower is better):
What changes are included in this PR?
parquet/src/column/chunker/ArrowColumnWriterCdcOptionsstruct (min_chunk_size,max_chunk_size,norm_level)repeated_ancestor_def_levelfield to for nested field values iterationAre these changes tested?
Yes — unit tests are located in
cdc.rsand ported from the C++ implementation.Are there any user-facing changes?
New experimental API, disabled by default — no behavior change for existing code: