feat(parquet): add content defined chunking for arrow writer by kszucs · Pull Request #9450 · apache/arrow-rs

kszucs · 2026-02-20T15:26:52Z

Which issue does this PR close?

Closes #NNN.

Rationale for this change

Rust implementation of apache/arrow#45360

Traditional Parquet writing splits data pages at fixed sizes, so a single inserted or deleted row causes all subsequent pages to shift — resulting in nearly every byte being re-uploaded to content-addressable storage (CAS) systems. CDC determines page boundaries via a rolling gearhash over column values, so unchanged data produces identical pages across different writes enabling storage cost reductions and faster upload times.

See more details in https://huggingface.co/blog/parquet-cdc

The original C++ implementation apache/arrow#45360

Evaluation tool https://github.com/huggingface/dataset-dedupe-estimator where I already integrated this PR to verify that deduplication effectiveness is on par with parquet-cpp (lower is better):

What changes are included in this PR?

Content-defined chunker at parquet/src/column/chunker/
Arrow writer integration integrated in ArrowColumnWriter
Writer properties via CdcOptions struct (min_chunk_size, max_chunk_size, norm_level)
ColumnDescriptor: added repeated_ancestor_def_level field to for nested field values iteration

Are these changes tested?

Yes — unit tests are located in cdc.rs and ported from the C++ implementation.

Are there any user-facing changes?

New experimental API, disabled by default — no behavior change for existing code:

// Simple toggle (256 KiB min, 1 MiB max, norm_level 0)
let props = WriterProperties::builder()
    .set_content_defined_chunking(true)
    .build();

// Excpliti CDC parameters
let props = WriterProperties::builder()
    .set_cdc_options(CdcOptions { min_chunk_size: 128 * 1024, max_chunk_size: 512 * 1024, norm_level: 1 })
    .build();

…` and use it in content defined chunking

kszucs added 2 commits February 20, 2026 16:26

feat(parquet): add content defined chunking for arrow writer

a80606a

feat(parquet): add repeated_ancestor_def_level to `ColumnDescriptor…

2fb8a67

…` and use it in content defined chunking

github-actions bot added the parquet Changes to the parquet crate label Feb 20, 2026

kszucs added 7 commits February 20, 2026 16:28

chore: cargo format

7e8b9fd

chore: fix clippy errors

9d57c41

refactor(parquet): maintain field for better encapsulation

644d0ce

refactor(parquet): simplify the CDC implementation

34b1b04

refactor(parquet): hold the cdc chunkers in ArrowWriter

a1d5724

chore(parquet): remove redundant flush_current_page() method

25cb3ed

doc(parquet): remove content defined chunking example from dosctrings

f6a71fa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat(parquet): add content defined chunking for arrow writer#9450

feat(parquet): add content defined chunking for arrow writer#9450
kszucs wants to merge 9 commits intoapache:mainfrom
kszucs:content-defined-chunking

kszucs commented Feb 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

kszucs commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kszucs commented Feb 20, 2026 •

edited

Loading