IN LIST: add two-stage filter for Utf8 and LargeUtf8 by geoffreyclaude · Pull Request #23017 · apache/datafusion

geoffreyclaude · 2026-06-18T07:32:47Z

Which issue does this PR close?

Part of Further improve performance of IN list evaluation #19241.
Stacked on IN LIST: add string-view filters for Utf8View and BinaryView #23016.
Extracted from Optimize IN performance with specialized implementations #19390.

Rationale for this change

#23016 optimizes Utf8View and BinaryView, where Arrow already stores length and prefix information in each view. Regular Utf8 and LargeUtf8 arrays do not have that same view layout, but short strings can still use the same basic idea.

For an all-short constant list, each string can be encoded into one fixed-size value containing:

string length + first bytes of the string

For example, the list ('cat', 'dog') can be pre-encoded once. Then each input string is encoded the same way and checked with the direct-probe lookup from #23015.

This is only used when every string in the constant list is short. If the list contains long strings, DataFusion keeps using the generic fallback path because encoding plus verification was not a win for that case.

What changes are included in this PR?

Adds Utf8TwoStageFilter for regular Utf8 and LargeUtf8 arrays.
Encodes valid haystack strings as length plus prefix bytes for fast lookup.
Routes all-short Utf8 / LargeUtf8 constant lists to the encoded direct-probe path.
Keeps exact long-string verification inside the filter for correctness if that filter is used directly.
Keeps long-string constant lists on the generic fallback path.
Adds focused coverage for slices, nulls, and long strings with matching prefixes but different suffixes.

Are these changes tested?

Yes.

cargo fmt --all --check
cargo test -p datafusion-physical-expr utf8_two_stage_filter --lib
cargo test -p datafusion-physical-expr in_list_string_types --lib
cargo test -p datafusion-physical-expr test_in_list_from_array_type_combinations --lib
cargo test -p datafusion-physical-expr in_list_utf8_with_dict_types --lib
cargo clippy -p datafusion-physical-expr --all-targets --all-features -- -D warnings

Are there any user-facing changes?

No. This is an internal performance optimization only.

Local benchmark snapshot

Benchmark command:

cargo bench -p datafusion-physical-expr --profile release-nonlto --bench in_list_strategy -- --save-baseline <name>

Method: compare adjacent saved baselines using raw Criterion sample minima (min(time / iters)). Lower is better; changes within +/-5% are treated as noise.

Compared baselines: #23016 -> #23017

Relevant scope: Utf8 and nullable Utf8 rows.

Summary: 20 relevant rows, 2 faster, 6 slower, 12 within +/-5%.

Largest relevant deltas:

Benchmark	Before	After	Change
`utf8/short_8b/list=4/match=0%`	29.89 us	52.16 us	+74.5% (1.74x slower)
`utf8/short_8b/list=256/match=0%`	29.86 us	51.76 us	+73.3% (1.73x slower)
`utf8/short_8b/list=64/match=0%`	30.38 us	52.16 us	+71.7% (1.72x slower)
`nulls/utf8/short_8b/list=16/match=50%/nulls=20%`	68.27 us	53.79 us	-21.2% (1.27x faster)
`utf8/short_8b/list=4/match=50%`	75.47 us	67.86 us	-10.1% (1.11x faster)
`nulls/utf8/long_24b/list=16/match=50%/nulls=20%`	70.00 us	75.24 us	+7.5% (1.07x slower)
`utf8/short_8b/list=16/match=50%/NOT_IN`	71.34 us	76.47 us	+7.2% (1.07x slower)
`utf8/short_8b/list=64/match=50%`	73.88 us	77.72 us	+5.2% (1.05x slower)

Full relevant table (20 rows)

Benchmark	Before	After	Change
`nulls/utf8/long_24b/list=16/match=50%/nulls=20%`	70.00 us	75.24 us	+7.5% (1.07x slower)
`nulls/utf8/short_8b/list=16/match=50%/nulls=20%`	68.27 us	53.79 us	-21.2% (1.27x faster)
`utf8/long_24b/list=256/match=0%`	36.22 us	36.02 us	-0.5% (within +/-5%)
`utf8/long_24b/list=256/match=50%`	74.69 us	73.42 us	-1.7% (within +/-5%)
`utf8/long_24b/list=4/match=0%`	35.73 us	36.02 us	+0.8% (within +/-5%)
`utf8/long_24b/list=4/match=50%`	72.00 us	73.73 us	+2.4% (within +/-5%)
`utf8/long_24b/list=64/match=0%`	36.18 us	35.77 us	-1.1% (within +/-5%)
`utf8/long_24b/list=64/match=50%`	74.42 us	74.91 us	+0.7% (within +/-5%)
`utf8/mixed_len/list=16/match=0%`	47.05 us	46.10 us	-2.0% (within +/-5%)
`utf8/mixed_len/list=16/match=50%`	130.43 us	129.44 us	-0.8% (within +/-5%)
`utf8/mixed_len/list=64/match=0%`	52.34 us	50.84 us	-2.9% (within +/-5%)
`utf8/mixed_len/list=64/match=50%`	134.09 us	139.30 us	+3.9% (within +/-5%)
`utf8/shared_prefix/pfx=12/list=32/match=50%`	73.68 us	73.50 us	-0.2% (within +/-5%)
`utf8/short_8b/list=16/match=50%/NOT_IN`	71.34 us	76.47 us	+7.2% (1.07x slower)
`utf8/short_8b/list=256/match=0%`	29.86 us	51.76 us	+73.3% (1.73x slower)
`utf8/short_8b/list=256/match=50%`	73.63 us	72.05 us	-2.1% (within +/-5%)
`utf8/short_8b/list=4/match=0%`	29.89 us	52.16 us	+74.5% (1.74x slower)
`utf8/short_8b/list=4/match=50%`	75.47 us	67.86 us	-10.1% (1.11x faster)
`utf8/short_8b/list=64/match=0%`	30.38 us	52.16 us	+71.7% (1.72x slower)
`utf8/short_8b/list=64/match=50%`	73.88 us	77.72 us	+5.2% (1.05x slower)

Replaces HashSet<u8> with a 32-byte stack-allocated bitmap. Provides O(1) membership testing via bit-shifting, significantly reducing memory overhead and improving cache locality. Triggers for UInt8 arrays.

Implements an 8 KB heap-allocated bitmap for UInt16. Maintains O(1) performance while handling the larger value space. Triggers for UInt16 arrays.

Introduces zero-copy buffer reinterpretation to allow signed integers and other 1 or 2-byte primitive types (e.g. Float16) to use the high-performance bitmap filters. Triggers for all types with 1-byte or 2-byte width.

Adds a const-generic unrolled comparison chain that avoids CPU branching. Outperforms hash lookups for very small lists. Triggers for primitives when list size <= 32 (4-byte), 16 (8-byte), or 4 (16-byte).

Implements a fast hash table using open addressing with linear probing and a 25% load factor. Replaces the legacy HashSet for primitives, reducing indirection. Triggers for primitives when list size exceeds branchless thresholds.

Introduces a two-stage filter for ByteView types. Stage 1 uses a fast DirectProbeFilter on masked views (len + prefix) for quick rejection; Stage 2 performs full verification only for potential long-string matches. Triggers for Utf8View and BinaryView.

Port of the two-stage View optimization to standard Utf8 and LargeUtf8 types. Encodes strings as i128 (len + prefix) for fast O(1) pre-filtering before falling back to full string comparison. Triggers for Utf8 and LargeUtf8.

geoffreyclaude · 2026-06-18T20:56:58Z

Closing this one after the local benchmark pass. The regular Utf8 / LargeUtf8 two-stage path did not justify the extra specialization: the #23016 -> #23017 snapshot had 20 relevant rows, with 2 faster, 6 slower, and 12 within +/-5%. The miss-heavy short-Utf8 cases regressed by about 1.7x.

The stack now skips this PR and continues directly from #23016 to #23018.

geoffreyclaude added 3 commits June 18, 2026 08:30

Refactor generic InList static filter helpers

e31fafe

Build InList results from bitmaps

afc196b

Optimize generic InList static filtering

a84579d

geoffreyclaude mentioned this pull request Jun 18, 2026

IN LIST: reinterpret FixedSizeBinary for primitive fast paths #23018

Draft

github-actions Bot added the physical-expr Changes to the physical-expr crates label Jun 18, 2026

geoffreyclaude mentioned this pull request Jun 18, 2026

Further improve performance of IN list evaluation #19241

Open

10 tasks

github-actions Bot added the auto detected api change Auto detected API change label Jun 18, 2026

Implement Bitmap Filter for UInt8 (Stack-based)

b910c6a

Replaces HashSet<u8> with a 32-byte stack-allocated bitmap. Provides O(1) membership testing via bit-shifting, significantly reducing memory overhead and improving cache locality. Triggers for UInt8 arrays.

geoffreyclaude force-pushed the perf/in_list_utf8_two_stage_filter branch from a19d00b to f68b456 Compare June 18, 2026 08:26

geoffreyclaude added 4 commits June 18, 2026 10:40

Extend Bitmap Filter to UInt16 (Heap-based)

81ec379

Implements an 8 KB heap-allocated bitmap for UInt16. Maintains O(1) performance while handling the larger value space. Triggers for UInt16 arrays.

Implement Zero-Copy Reinterpretation and enable Int8/Int16 Bitmaps

9925e82

Introduces zero-copy buffer reinterpretation to allow signed integers and other 1 or 2-byte primitive types (e.g. Float16) to use the high-performance bitmap filters. Triggers for all types with 1-byte or 2-byte width.

Implement Branchless Filter for small primitive lists

eae4046

Adds a const-generic unrolled comparison chain that avoids CPU branching. Outperforms hash lookups for very small lists. Triggers for primitives when list size <= 32 (4-byte), 16 (8-byte), or 4 (16-byte).

geoffreyclaude force-pushed the perf/in_list_utf8_two_stage_filter branch from f68b456 to c5f3209 Compare June 18, 2026 09:47

github-actions Bot removed the auto detected api change Auto detected API change label Jun 18, 2026

geoffreyclaude added 2 commits June 18, 2026 12:27

Implement Legacy String Optimization (Utf8TwoStageFilter)

24fa79a

Port of the two-stage View optimization to standard Utf8 and LargeUtf8 types. Encodes strings as i128 (len + prefix) for fast O(1) pre-filtering before falling back to full string comparison. Triggers for Utf8 and LargeUtf8.

geoffreyclaude force-pushed the perf/in_list_utf8_two_stage_filter branch from c5f3209 to 24fa79a Compare June 18, 2026 10:29

geoffreyclaude mentioned this pull request Jun 18, 2026

Optimize IN performance with specialized implementations #19390

Closed

geoffreyclaude changed the title ~~Implement Legacy String Optimization (Utf8TwoStageFilter)~~ IN LIST: add two-stage filter for Utf8 and LargeUtf8 Jun 18, 2026

geoffreyclaude closed this Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IN LIST: add two-stage filter for Utf8 and LargeUtf8#23017

IN LIST: add two-stage filter for Utf8 and LargeUtf8#23017
geoffreyclaude wants to merge 10 commits into
apache:mainfrom
geoffreyclaude:perf/in_list_utf8_two_stage_filter

geoffreyclaude commented Jun 18, 2026 •

edited

Loading

Uh oh!

geoffreyclaude commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

geoffreyclaude commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Local benchmark snapshot

Uh oh!

geoffreyclaude commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

geoffreyclaude commented Jun 18, 2026 •

edited

Loading