Skip to content

Comments

parquet: decoder-level dictionary filter pushdown for Parquet reader#9464

Draft
lyang24 wants to merge 1 commit intoapache:mainfrom
lyang24:dict_decoder_predict_pushdown
Draft

parquet: decoder-level dictionary filter pushdown for Parquet reader#9464
lyang24 wants to merge 1 commit intoapache:mainfrom
lyang24:dict_decoder_predict_pushdown

Conversation

@lyang24
Copy link
Contributor

@lyang24 lyang24 commented Feb 23, 2026

Which issue does this PR close?

Rationale for this change

for columns that is dictionary encoded
we use to turn dictionary encoding into StringViewArray then we compare against the filter

What changes are included in this PR?

for a column that is 100% dictionary encoded
we decode the dictionary keys, then apply filter, then expand the nulls, at last we filter the value and only materialize those who survived the filter.

Are these changes tested?

existing test should pass added additional tests

Are there any user-facing changes?

yes added an new api with_dict_pushdown to opt in to enable the predicate pushdown on predicates

@github-actions github-actions bot added the parquet Changes to the parquet crate label Feb 23, 2026
@lyang24 lyang24 changed the title perf: decoder-level dictionary filter pushdown for Parquet reader parquet: decoder-level dictionary filter pushdown for Parquet reader Feb 23, 2026
@lyang24 lyang24 force-pushed the dict_decoder_predict_pushdown branch 2 times, most recently from 0116686 to 2629604 Compare February 23, 2026 07:30
Push predicate evaluation into the decoder for dictionary-encoded
BYTE_ARRAY columns. Instead of decoding all rows into StringViewArray
and then filtering, this evaluates the predicate once on the small
dictionary (~N unique values), then maps integer keys to booleans
via a simple lookup — no intermediate arrays are created for data rows.

Two-phase approach:
- Phase 1: decode dictionary page, evaluate predicate, produce
  matching_keys: Vec<bool> per row group
- Phase 2: DictFilterDecoder reads RLE-encoded integer keys and
  maps each key to matching_keys[key], producing a BooleanArray
  that feeds directly into RowSelection

Adds ArrowPredicate::use_dictionary_encoding() opt-in flag and
falls back to the normal path for multi-column predicates,
non-BYTE_ARRAY types, nested columns, or PLAIN-encoded pages.
@lyang24 lyang24 force-pushed the dict_decoder_predict_pushdown branch from 2629604 to a4545dd Compare February 23, 2026 07:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Push down (scalar) filters down to Parquet encoders

1 participant