fix: Parquet bloom filter pruning can incorrectly filter decimals encoded as FIXED_LEN_BYTE_ARRAY by lyne7-sc · Pull Request #22995 · apache/datafusion

lyne7-sc · 2026-06-17T13:00:08Z

Which issue does this PR close?

Closes Parquet bloom filter pruning can incorrectly filter decimals encoded as FIXED_LEN_BYTE_ARRAY #22994.

Rationale for this change

Parquet bloom filter pruning can incorrectly prune decimal columns encoded as FIXED_LEN_BYTE_ARRAY.

Bloom filters are checked against the physical bytes stored in the Parquet file. For FIXED_LEN_BYTE_ARRAY, the byte width comes from the Parquet column descriptor's type_length. DataFusion was checking decimal literals using a fixed-width integer byte representation, which can differ from thefile's fixed byte width and cause false negatives.

What changes are included in this PR?

Carry the Parquet column type_length together with the bloom filter metadata.
Use type_length when checking decimal literals against FIXED_LEN_BYTE_ARRAY bloom filters.
Fall back to conservative pruning behavior when the fixed byte length cannot be represented safely.
Add a regression test for fixed-length decimal bloom filter pruning.

Are these changes tested?

Yes.

Are there any user-facing changes?

No API changes. This fixes incorrect query results when Parquet bloom filter pruning is enabled for fixed-length decimal columns.

kosiew

@lyne7-sc
Thanks for the fix. I left a couple of small suggestions, but nothing blocking from me.

kosiew · 2026-06-18T09:07:37Z

    }

+    #[tokio::test]
+    async fn test_row_group_bloom_filter_pruning_predicate_decimal128() {


Nice regression coverage for the fixed-width truncation path. It might be worth adding a negative decimal case as well, since Parquet fixed-len decimal bytes depend on two's-complement sign extension and truncation. For example, you could write row groups with negative values and assert that a predicate like decimal_col = -500 keeps only the matching row group.

kosiew · 2026-06-18T09:07:37Z

    /// * Parquet physical [`Type`] needed to evaluate  literals against the filter
-    column_sbbf: HashMap<String, (Sbbf, Type)>,
+    /// * Type length from the Parquet column descriptor
+    column_sbbf: HashMap<String, (Sbbf, Type, i32)>,


This tuple now carries a pretty important type_length contract. A small named struct, such as struct ColumnBloomFilter { sbbf: Sbbf, physical_type: Type, type_length: i32 }, could make the invariant clearer and help avoid accidentally mixing up tuple fields at call sites.

fix

d33aae6

github-actions Bot added the datasource Changes to the datasource crate label Jun 17, 2026

Merge branch 'main' into fix/bloom_filter_decimal

892f162

kosiew approved these changes Jun 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Parquet bloom filter pruning can incorrectly filter decimals encoded as FIXED_LEN_BYTE_ARRAY#22995

fix: Parquet bloom filter pruning can incorrectly filter decimals encoded as FIXED_LEN_BYTE_ARRAY#22995
lyne7-sc wants to merge 2 commits into
apache:mainfrom
lyne7-sc:fix/bloom_filter_decimal

lyne7-sc commented Jun 17, 2026

Uh oh!

kosiew left a comment •

edited

Loading

Uh oh!

kosiew Jun 18, 2026

Uh oh!

kosiew Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lyne7-sc commented Jun 17, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

kosiew left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kosiew Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

kosiew Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kosiew left a comment •

edited

Loading