Skip to content

fix: Parquet bloom filter pruning can incorrectly filter decimals encoded as FIXED_LEN_BYTE_ARRAY#22995

Open
lyne7-sc wants to merge 2 commits into
apache:mainfrom
lyne7-sc:fix/bloom_filter_decimal
Open

fix: Parquet bloom filter pruning can incorrectly filter decimals encoded as FIXED_LEN_BYTE_ARRAY#22995
lyne7-sc wants to merge 2 commits into
apache:mainfrom
lyne7-sc:fix/bloom_filter_decimal

Conversation

@lyne7-sc

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

Parquet bloom filter pruning can incorrectly prune decimal columns encoded as FIXED_LEN_BYTE_ARRAY.

Bloom filters are checked against the physical bytes stored in the Parquet file. For FIXED_LEN_BYTE_ARRAY, the byte width comes from the Parquet column descriptor's type_length. DataFusion was checking decimal literals using a fixed-width integer byte representation, which can differ from thefile's fixed byte width and cause false negatives.

What changes are included in this PR?

  • Carry the Parquet column type_length together with the bloom filter metadata.
  • Use type_length when checking decimal literals against FIXED_LEN_BYTE_ARRAY bloom filters.
  • Fall back to conservative pruning behavior when the fixed byte length cannot be represented safely.
  • Add a regression test for fixed-length decimal bloom filter pruning.

Are these changes tested?

Yes.

Are there any user-facing changes?

No API changes. This fixes incorrect query results when Parquet bloom filter pruning is enabled for fixed-length decimal columns.

@github-actions github-actions Bot added the datasource Changes to the datasource crate label Jun 17, 2026

@kosiew kosiew left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lyne7-sc
Thanks for the fix. I left a couple of small suggestions, but nothing blocking from me.

}

#[tokio::test]
async fn test_row_group_bloom_filter_pruning_predicate_decimal128() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice regression coverage for the fixed-width truncation path. It might be worth adding a negative decimal case as well, since Parquet fixed-len decimal bytes depend on two's-complement sign extension and truncation. For example, you could write row groups with negative values and assert that a predicate like decimal_col = -500 keeps only the matching row group.

/// * Parquet physical [`Type`] needed to evaluate literals against the filter
column_sbbf: HashMap<String, (Sbbf, Type)>,
/// * Type length from the Parquet column descriptor
column_sbbf: HashMap<String, (Sbbf, Type, i32)>,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tuple now carries a pretty important type_length contract. A small named struct, such as struct ColumnBloomFilter { sbbf: Sbbf, physical_type: Type, type_length: i32 }, could make the invariant clearer and help avoid accidentally mixing up tuple fields at call sites.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet bloom filter pruning can incorrectly filter decimals encoded as FIXED_LEN_BYTE_ARRAY

2 participants