Skip to content

IN LIST: add branchless filter for small primitive lists#23014

Draft
geoffreyclaude wants to merge 8 commits into
apache:mainfrom
geoffreyclaude:perf/in_list_branchless_filter
Draft

IN LIST: add branchless filter for small primitive lists#23014
geoffreyclaude wants to merge 8 commits into
apache:mainfrom
geoffreyclaude:perf/in_list_branchless_filter

Conversation

@geoffreyclaude

@geoffreyclaude geoffreyclaude commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

For very small IN lists, building or probing a hash table can be more work than just comparing the input value with each constant.

For example, for x IN (10, 20, 30), the fast path can behave like:

x == 10 OR x == 20 OR x == 30

Because the list is tiny, those comparisons are cheap. The implementation stores the constants in a fixed-size array and checks them with a compact comparison chain.

“Branchless” here means the comparisons are combined without stopping at the first match. That can be faster for these small fixed-width lists because the CPU gets a predictable sequence of simple operations instead of hash-table setup and probe logic.

Larger primitive lists are left for #23015, where a purpose-built hash lookup becomes the better tradeoff.

What changes are included in this PR?

  • Adds a const-generic BranchlessFilter for small primitive IN lists.
  • Adds thresholds for when this path is used:
    • up to 32 values for 4-byte types
    • up to 16 values for 8-byte types
    • up to 4 values for 16-byte types
  • Reuses the zero-copy reinterpretation approach from IN LIST: reinterpret small-width types for bitmap filters #23013 for compatible same-width primitive types.
  • Keeps the same IN / NOT IN null behavior as the rest of the stack.
  • Adds focused coverage for sliced reinterpreted arrays.

Are these changes tested?

Yes.

  • cargo fmt --all --check
  • cargo test -p datafusion-physical-expr reinterpreted_ --lib
  • cargo test -p datafusion-physical-expr in_list_int_types --lib
  • cargo test -p datafusion-physical-expr in_list_float64 --lib
  • cargo test -p datafusion-physical-expr in_list_decimal --lib
  • cargo test -p datafusion-physical-expr test_in_list_from_array_type_combinations --lib
  • cargo clippy -p datafusion-physical-expr --all-targets --all-features -- -D warnings

Are there any user-facing changes?

No. This is an internal performance optimization only.

Local benchmark snapshot

Benchmark command:

cargo bench -p datafusion-physical-expr --profile release-nonlto --bench in_list_strategy -- --save-baseline <name>

Method: compare adjacent saved baselines using raw Criterion sample minima (min(time / iters)). Lower is better; changes within +/-5% are treated as noise.

Compared baselines: #23013 -> #23014

Relevant scope: small primitive-list rows.

Summary: 20 relevant rows, 20 faster, 0 slower, 0 within +/-5%.

Largest relevant deltas:

Benchmark Before After Change
timestamp_ns/small_list/list=4/match=50% 47.55 us 3.25 us -93.2% (14.63x faster)
f32/small_list/list=4/match=50% 33.79 us 3.04 us -91.0% (11.12x faster)
primitive/i32/small_list/list=4/match=50% 31.84 us 3.04 us -90.5% (10.48x faster)
primitive/i64/small_list/list=4/match=50% 28.43 us 3.20 us -88.7% (8.87x faster)
f32/small_list/list=4/match=0% 18.97 us 3.00 us -84.2% (6.32x faster)
timestamp_ns/small_list/list=4/match=0% 19.48 us 3.27 us -83.2% (5.96x faster)
primitive/i32/small_list/list=4/match=0% 17.19 us 3.03 us -82.4% (5.67x faster)
primitive/i64/small_list/list=4/match=0% 16.63 us 3.21 us -80.7% (5.18x faster)
primitive/i32/small_list/list=16/match=50%/NOT_IN 33.19 us 7.29 us -78.0% (4.56x faster)
nulls/primitive/i32/small_list/list=16/match=50%/nulls=20% 30.33 us 7.19 us -76.3% (4.22x faster)
nulls/primitive/i32/small_list/list=16/match=50%/nulls=20%/NOT_IN 30.02 us 7.30 us -75.7% (4.11x faster)
timestamp_ns/small_list/list=16/match=50% 42.51 us 11.81 us -72.2% (3.60x faster)
nulls/primitive/i32/small_list/list=16/match=50%/nulls=50% 21.36 us 7.19 us -66.4% (2.97x faster)
f32/small_list/list=32/match=50% 38.89 us 13.20 us -66.1% (2.95x faster)
primitive/i64/small_list/list=16/match=50% 30.68 us 12.12 us -60.5% (2.53x faster)
Full relevant table (20 rows)
Benchmark Before After Change
f32/small_list/list=32/match=0% 21.56 us 13.33 us -38.2% (1.62x faster)
f32/small_list/list=32/match=50% 38.89 us 13.20 us -66.1% (2.95x faster)
f32/small_list/list=4/match=0% 18.97 us 3.00 us -84.2% (6.32x faster)
f32/small_list/list=4/match=50% 33.79 us 3.04 us -91.0% (11.12x faster)
nulls/primitive/i32/small_list/list=16/match=50%/nulls=20% 30.33 us 7.19 us -76.3% (4.22x faster)
nulls/primitive/i32/small_list/list=16/match=50%/nulls=20%/NOT_IN 30.02 us 7.30 us -75.7% (4.11x faster)
nulls/primitive/i32/small_list/list=16/match=50%/nulls=50% 21.36 us 7.19 us -66.4% (2.97x faster)
primitive/i32/small_list/list=16/match=50%/NOT_IN 33.19 us 7.29 us -78.0% (4.56x faster)
primitive/i32/small_list/list=32/match=0% 17.11 us 13.33 us -22.1% (1.28x faster)
primitive/i32/small_list/list=32/match=50% 29.63 us 13.34 us -55.0% (2.22x faster)
primitive/i32/small_list/list=4/match=0% 17.19 us 3.03 us -82.4% (5.67x faster)
primitive/i32/small_list/list=4/match=50% 31.84 us 3.04 us -90.5% (10.48x faster)
primitive/i64/small_list/list=16/match=0% 17.16 us 11.91 us -30.6% (1.44x faster)
primitive/i64/small_list/list=16/match=50% 30.68 us 12.12 us -60.5% (2.53x faster)
primitive/i64/small_list/list=4/match=0% 16.63 us 3.21 us -80.7% (5.18x faster)
primitive/i64/small_list/list=4/match=50% 28.43 us 3.20 us -88.7% (8.87x faster)
timestamp_ns/small_list/list=16/match=0% 19.59 us 11.79 us -39.8% (1.66x faster)
timestamp_ns/small_list/list=16/match=50% 42.51 us 11.81 us -72.2% (3.60x faster)
timestamp_ns/small_list/list=4/match=0% 19.48 us 3.27 us -83.2% (5.96x faster)
timestamp_ns/small_list/list=4/match=50% 47.55 us 3.25 us -93.2% (14.63x faster)

@github-actions github-actions Bot added the physical-expr Changes to the physical-expr crates label Jun 18, 2026
@geoffreyclaude geoffreyclaude force-pushed the perf/in_list_branchless_filter branch 3 times, most recently from 428e3cd to eae4046 Compare June 18, 2026 09:03
@geoffreyclaude geoffreyclaude changed the title Implement Branchless Filter for small primitive lists IN LIST: add branchless filter for small primitive lists Jun 18, 2026
@geoffreyclaude geoffreyclaude force-pushed the perf/in_list_branchless_filter branch from eae4046 to 3e3651b Compare June 19, 2026 05:11
@github-actions github-actions Bot added the auto detected api change Auto detected API change label Jun 19, 2026
@geoffreyclaude geoffreyclaude force-pushed the perf/in_list_branchless_filter branch from 3e3651b to 6a1869f Compare June 19, 2026 05:35
@github-actions github-actions Bot removed the auto detected api change Auto detected API change label Jun 19, 2026
Replaces HashSet<u8> with a 32-byte stack-allocated bitmap. Provides O(1) membership testing via bit-shifting, significantly reducing memory overhead and improving cache locality. Triggers for UInt8 arrays.
Implements an 8 KB heap-allocated bitmap for UInt16. Maintains O(1) performance while handling the larger value space. Triggers for UInt16 arrays.
Introduces zero-copy buffer reinterpretation to allow signed integers and other 1 or 2-byte primitive types (e.g. Float16) to use the high-performance bitmap filters. Triggers for all types with 1-byte or 2-byte width.
Adds a const-generic unrolled comparison chain that avoids CPU branching. Outperforms hash lookups for very small lists. Triggers for primitives when list size <= 32 (4-byte), 16 (8-byte), or 4 (16-byte).
@geoffreyclaude geoffreyclaude force-pushed the perf/in_list_branchless_filter branch from 6a1869f to d57ed3f Compare June 19, 2026 05:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-expr Changes to the physical-expr crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant