IN LIST: add two-stage filter for Utf8 and LargeUtf8#23017
Closed
geoffreyclaude wants to merge 10 commits into
Closed
IN LIST: add two-stage filter for Utf8 and LargeUtf8#23017geoffreyclaude wants to merge 10 commits into
geoffreyclaude wants to merge 10 commits into
Conversation
10 tasks
Replaces HashSet<u8> with a 32-byte stack-allocated bitmap. Provides O(1) membership testing via bit-shifting, significantly reducing memory overhead and improving cache locality. Triggers for UInt8 arrays.
a19d00b to
f68b456
Compare
Implements an 8 KB heap-allocated bitmap for UInt16. Maintains O(1) performance while handling the larger value space. Triggers for UInt16 arrays.
Introduces zero-copy buffer reinterpretation to allow signed integers and other 1 or 2-byte primitive types (e.g. Float16) to use the high-performance bitmap filters. Triggers for all types with 1-byte or 2-byte width.
Adds a const-generic unrolled comparison chain that avoids CPU branching. Outperforms hash lookups for very small lists. Triggers for primitives when list size <= 32 (4-byte), 16 (8-byte), or 4 (16-byte).
Implements a fast hash table using open addressing with linear probing and a 25% load factor. Replaces the legacy HashSet for primitives, reducing indirection. Triggers for primitives when list size exceeds branchless thresholds.
f68b456 to
c5f3209
Compare
Introduces a two-stage filter for ByteView types. Stage 1 uses a fast DirectProbeFilter on masked views (len + prefix) for quick rejection; Stage 2 performs full verification only for potential long-string matches. Triggers for Utf8View and BinaryView.
Port of the two-stage View optimization to standard Utf8 and LargeUtf8 types. Encodes strings as i128 (len + prefix) for fast O(1) pre-filtering before falling back to full string comparison. Triggers for Utf8 and LargeUtf8.
c5f3209 to
24fa79a
Compare
Contributor
Author
|
Closing this one after the local benchmark pass. The regular The stack now skips this PR and continues directly from #23016 to #23018. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
INperformance with specialized implementations #19390.Rationale for this change
#23016 optimizes
Utf8ViewandBinaryView, where Arrow already stores length and prefix information in each view. RegularUtf8andLargeUtf8arrays do not have that same view layout, but short strings can still use the same basic idea.For an all-short constant list, each string can be encoded into one fixed-size value containing:
For example, the list
('cat', 'dog')can be pre-encoded once. Then each input string is encoded the same way and checked with the direct-probe lookup from #23015.This is only used when every string in the constant list is short. If the list contains long strings, DataFusion keeps using the generic fallback path because encoding plus verification was not a win for that case.
What changes are included in this PR?
Utf8TwoStageFilterfor regularUtf8andLargeUtf8arrays.Utf8/LargeUtf8constant lists to the encoded direct-probe path.Are these changes tested?
Yes.
cargo fmt --all --checkcargo test -p datafusion-physical-expr utf8_two_stage_filter --libcargo test -p datafusion-physical-expr in_list_string_types --libcargo test -p datafusion-physical-expr test_in_list_from_array_type_combinations --libcargo test -p datafusion-physical-expr in_list_utf8_with_dict_types --libcargo clippy -p datafusion-physical-expr --all-targets --all-features -- -D warningsAre there any user-facing changes?
No. This is an internal performance optimization only.
Local benchmark snapshot
Benchmark command:
Method: compare adjacent saved baselines using raw Criterion sample minima (
min(time / iters)). Lower is better; changes within +/-5% are treated as noise.Compared baselines: #23016 -> #23017
Relevant scope: Utf8 and nullable Utf8 rows.
Summary: 20 relevant rows, 2 faster, 6 slower, 12 within +/-5%.
Largest relevant deltas:
utf8/short_8b/list=4/match=0%utf8/short_8b/list=256/match=0%utf8/short_8b/list=64/match=0%nulls/utf8/short_8b/list=16/match=50%/nulls=20%utf8/short_8b/list=4/match=50%nulls/utf8/long_24b/list=16/match=50%/nulls=20%utf8/short_8b/list=16/match=50%/NOT_INutf8/short_8b/list=64/match=50%Full relevant table (20 rows)
nulls/utf8/long_24b/list=16/match=50%/nulls=20%nulls/utf8/short_8b/list=16/match=50%/nulls=20%utf8/long_24b/list=256/match=0%utf8/long_24b/list=256/match=50%utf8/long_24b/list=4/match=0%utf8/long_24b/list=4/match=50%utf8/long_24b/list=64/match=0%utf8/long_24b/list=64/match=50%utf8/mixed_len/list=16/match=0%utf8/mixed_len/list=16/match=50%utf8/mixed_len/list=64/match=0%utf8/mixed_len/list=64/match=50%utf8/shared_prefix/pfx=12/list=32/match=50%utf8/short_8b/list=16/match=50%/NOT_INutf8/short_8b/list=256/match=0%utf8/short_8b/list=256/match=50%utf8/short_8b/list=4/match=0%utf8/short_8b/list=4/match=50%utf8/short_8b/list=64/match=0%utf8/short_8b/list=64/match=50%