Skip to content

IN LIST: add two-stage filter for Utf8 and LargeUtf8#23017

Closed
geoffreyclaude wants to merge 10 commits into
apache:mainfrom
geoffreyclaude:perf/in_list_utf8_two_stage_filter
Closed

IN LIST: add two-stage filter for Utf8 and LargeUtf8#23017
geoffreyclaude wants to merge 10 commits into
apache:mainfrom
geoffreyclaude:perf/in_list_utf8_two_stage_filter

Conversation

@geoffreyclaude

@geoffreyclaude geoffreyclaude commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

#23016 optimizes Utf8View and BinaryView, where Arrow already stores length and prefix information in each view. Regular Utf8 and LargeUtf8 arrays do not have that same view layout, but short strings can still use the same basic idea.

For an all-short constant list, each string can be encoded into one fixed-size value containing:

string length + first bytes of the string

For example, the list ('cat', 'dog') can be pre-encoded once. Then each input string is encoded the same way and checked with the direct-probe lookup from #23015.

This is only used when every string in the constant list is short. If the list contains long strings, DataFusion keeps using the generic fallback path because encoding plus verification was not a win for that case.

What changes are included in this PR?

  • Adds Utf8TwoStageFilter for regular Utf8 and LargeUtf8 arrays.
  • Encodes valid haystack strings as length plus prefix bytes for fast lookup.
  • Routes all-short Utf8 / LargeUtf8 constant lists to the encoded direct-probe path.
  • Keeps exact long-string verification inside the filter for correctness if that filter is used directly.
  • Keeps long-string constant lists on the generic fallback path.
  • Adds focused coverage for slices, nulls, and long strings with matching prefixes but different suffixes.

Are these changes tested?

Yes.

  • cargo fmt --all --check
  • cargo test -p datafusion-physical-expr utf8_two_stage_filter --lib
  • cargo test -p datafusion-physical-expr in_list_string_types --lib
  • cargo test -p datafusion-physical-expr test_in_list_from_array_type_combinations --lib
  • cargo test -p datafusion-physical-expr in_list_utf8_with_dict_types --lib
  • cargo clippy -p datafusion-physical-expr --all-targets --all-features -- -D warnings

Are there any user-facing changes?

No. This is an internal performance optimization only.

Local benchmark snapshot

Benchmark command:

cargo bench -p datafusion-physical-expr --profile release-nonlto --bench in_list_strategy -- --save-baseline <name>

Method: compare adjacent saved baselines using raw Criterion sample minima (min(time / iters)). Lower is better; changes within +/-5% are treated as noise.

Compared baselines: #23016 -> #23017

Relevant scope: Utf8 and nullable Utf8 rows.

Summary: 20 relevant rows, 2 faster, 6 slower, 12 within +/-5%.

Largest relevant deltas:

Benchmark Before After Change
utf8/short_8b/list=4/match=0% 29.89 us 52.16 us +74.5% (1.74x slower)
utf8/short_8b/list=256/match=0% 29.86 us 51.76 us +73.3% (1.73x slower)
utf8/short_8b/list=64/match=0% 30.38 us 52.16 us +71.7% (1.72x slower)
nulls/utf8/short_8b/list=16/match=50%/nulls=20% 68.27 us 53.79 us -21.2% (1.27x faster)
utf8/short_8b/list=4/match=50% 75.47 us 67.86 us -10.1% (1.11x faster)
nulls/utf8/long_24b/list=16/match=50%/nulls=20% 70.00 us 75.24 us +7.5% (1.07x slower)
utf8/short_8b/list=16/match=50%/NOT_IN 71.34 us 76.47 us +7.2% (1.07x slower)
utf8/short_8b/list=64/match=50% 73.88 us 77.72 us +5.2% (1.05x slower)
Full relevant table (20 rows)
Benchmark Before After Change
nulls/utf8/long_24b/list=16/match=50%/nulls=20% 70.00 us 75.24 us +7.5% (1.07x slower)
nulls/utf8/short_8b/list=16/match=50%/nulls=20% 68.27 us 53.79 us -21.2% (1.27x faster)
utf8/long_24b/list=256/match=0% 36.22 us 36.02 us -0.5% (within +/-5%)
utf8/long_24b/list=256/match=50% 74.69 us 73.42 us -1.7% (within +/-5%)
utf8/long_24b/list=4/match=0% 35.73 us 36.02 us +0.8% (within +/-5%)
utf8/long_24b/list=4/match=50% 72.00 us 73.73 us +2.4% (within +/-5%)
utf8/long_24b/list=64/match=0% 36.18 us 35.77 us -1.1% (within +/-5%)
utf8/long_24b/list=64/match=50% 74.42 us 74.91 us +0.7% (within +/-5%)
utf8/mixed_len/list=16/match=0% 47.05 us 46.10 us -2.0% (within +/-5%)
utf8/mixed_len/list=16/match=50% 130.43 us 129.44 us -0.8% (within +/-5%)
utf8/mixed_len/list=64/match=0% 52.34 us 50.84 us -2.9% (within +/-5%)
utf8/mixed_len/list=64/match=50% 134.09 us 139.30 us +3.9% (within +/-5%)
utf8/shared_prefix/pfx=12/list=32/match=50% 73.68 us 73.50 us -0.2% (within +/-5%)
utf8/short_8b/list=16/match=50%/NOT_IN 71.34 us 76.47 us +7.2% (1.07x slower)
utf8/short_8b/list=256/match=0% 29.86 us 51.76 us +73.3% (1.73x slower)
utf8/short_8b/list=256/match=50% 73.63 us 72.05 us -2.1% (within +/-5%)
utf8/short_8b/list=4/match=0% 29.89 us 52.16 us +74.5% (1.74x slower)
utf8/short_8b/list=4/match=50% 75.47 us 67.86 us -10.1% (1.11x faster)
utf8/short_8b/list=64/match=0% 30.38 us 52.16 us +71.7% (1.72x slower)
utf8/short_8b/list=64/match=50% 73.88 us 77.72 us +5.2% (1.05x slower)

Replaces HashSet<u8> with a 32-byte stack-allocated bitmap. Provides O(1) membership testing via bit-shifting, significantly reducing memory overhead and improving cache locality. Triggers for UInt8 arrays.
@geoffreyclaude geoffreyclaude force-pushed the perf/in_list_utf8_two_stage_filter branch from a19d00b to f68b456 Compare June 18, 2026 08:26
Implements an 8 KB heap-allocated bitmap for UInt16. Maintains O(1) performance while handling the larger value space. Triggers for UInt16 arrays.
Introduces zero-copy buffer reinterpretation to allow signed integers and other 1 or 2-byte primitive types (e.g. Float16) to use the high-performance bitmap filters. Triggers for all types with 1-byte or 2-byte width.
Adds a const-generic unrolled comparison chain that avoids CPU branching. Outperforms hash lookups for very small lists. Triggers for primitives when list size <= 32 (4-byte), 16 (8-byte), or 4 (16-byte).
Implements a fast hash table using open addressing with linear probing and a 25% load factor. Replaces the legacy HashSet for primitives, reducing indirection. Triggers for primitives when list size exceeds branchless thresholds.
@geoffreyclaude geoffreyclaude force-pushed the perf/in_list_utf8_two_stage_filter branch from f68b456 to c5f3209 Compare June 18, 2026 09:47
@github-actions github-actions Bot removed the auto detected api change Auto detected API change label Jun 18, 2026
Introduces a two-stage filter for ByteView types. Stage 1 uses a fast DirectProbeFilter on masked views (len + prefix) for quick rejection; Stage 2 performs full verification only for potential long-string matches. Triggers for Utf8View and BinaryView.
Port of the two-stage View optimization to standard Utf8 and LargeUtf8 types. Encodes strings as i128 (len + prefix) for fast O(1) pre-filtering before falling back to full string comparison. Triggers for Utf8 and LargeUtf8.
@geoffreyclaude geoffreyclaude force-pushed the perf/in_list_utf8_two_stage_filter branch from c5f3209 to 24fa79a Compare June 18, 2026 10:29
@geoffreyclaude geoffreyclaude changed the title Implement Legacy String Optimization (Utf8TwoStageFilter) IN LIST: add two-stage filter for Utf8 and LargeUtf8 Jun 18, 2026
@geoffreyclaude

Copy link
Copy Markdown
Contributor Author

Closing this one after the local benchmark pass. The regular Utf8 / LargeUtf8 two-stage path did not justify the extra specialization: the #23016 -> #23017 snapshot had 20 relevant rows, with 2 faster, 6 slower, and 12 within +/-5%. The miss-heavy short-Utf8 cases regressed by about 1.7x.

The stack now skips this PR and continues directly from #23016 to #23018.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-expr Changes to the physical-expr crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant