Optimize interleave boolean gather and add comprehensive benchmarks#8350
Optimize interleave boolean gather and add comprehensive benchmarks#8350joseph-isaacs wants to merge 8 commits into
Conversation
…terleave Extends the interleave benchmark to measure the Vortex boolean execute path against the arrow-rs `interleave` kernel on identical data, across three access patterns (random, round_robin, single_branch), for N=2 and N=4 value arrays, nullable and non-nullable. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…ecked reads Profiling and disassembly of the gather inner loop showed the per-bit cost was dominated not by the 64-bit packing (optimal for a random gather) but by the overhead around each read: a wide `&[BitBuffer]` struct index, the redundant bounds `assert` inside `BitBuffer::value`, and on the nullable path a per-row `Option`/`Mask`-variant dispatch. Pre-validate the per-row bounds once, then hoist a raw `(ptr, bit_offset)` per value buffer and read each selected bit with `get_bit_unchecked`. Materialize each branch's validity into one full-length `BitBuffer` so the validity gather is the same uniform unchecked read. Measured ~1.2x faster non-nullable and ~1.4x nullable across random / round-robin / single-branch patterns. Also adds two two-value word-boundary regression tests exercising the gather across more than one 64-bit packing word, in the non-nullable and nullable cases, validated against the existing scalar reference oracle. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Remove the arrow-rs `interleave` comparison arm and its input builder, keeping the Vortex-only benchmark across the random / round_robin / single_branch access patterns (N=2, N=4, nullable and non-nullable). No dependency changes: every arrow crate in vortex-array is used by the crate's own source. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Merging this PR will degrade performance by 23.61%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | chunked_bool_canonical_into[(1000, 10)] |
20.2 µs | 35.6 µs | -43.06% |
| ❌ | Simulation | chunked_varbinview_canonical_into[(1000, 10)] |
162 µs | 198.2 µs | -18.3% |
| ❌ | Simulation | chunked_varbinview_into_canonical[(1000, 10)] |
177.2 µs | 213.2 µs | -16.89% |
| ❌ | Simulation | bitwise_not_vortex_buffer_mut[128] |
215.3 ns | 244.4 ns | -11.93% |
| 🆕 | Simulation | vortex[random/n2/nonnull] |
N/A | 147 µs | N/A |
| 🆕 | Simulation | vortex[random/n2/null] |
N/A | 214.7 µs | N/A |
| 🆕 | Simulation | vortex[round_robin/n2/nonnull] |
N/A | 152.8 µs | N/A |
| 🆕 | Simulation | vortex[round_robin/n2/null] |
N/A | 214.3 µs | N/A |
| 🆕 | Simulation | vortex[random/n64/nonnull] |
N/A | 206.1 µs | N/A |
| 🆕 | Simulation | vortex[random/n64/null] |
N/A | 376 µs | N/A |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/eager-heisenberg-fxi8wx (b582134) with develop (3d7bbfb)
Footnotes
-
4 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
…ctor types Drop the `AsPrimitive<usize>` bound on the bool gather: dispatch both selectors over their concrete unsigned width via `match_each_unsigned_integer_ptype!` and convert to an index with a plain `as usize` on the native type, passed in as `branch_at` / `row_at` accessors. Keeps the selector loads at their native width. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Keep only 2-child round-robin, 2-child random, and 64-child random (each nullable and non-nullable); drop the single_branch pattern and the N=4 variants. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…> selectors Go back to the generic `gather<A: AsPrimitive<usize>, R: AsPrimitive<usize>>` with `.as_()` conversions, dropping the per-type `as usize` accessors. The trait conversion avoids the `cast_possible_truncation` lint without a scoped allow and matches the idiom used by the other bool kernels (e.g. take). Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…tern Replace `Combo.pattern: &'static str` and the `match`/`unreachable!` on string literals with a `Pattern` enum (RoundRobin, Random), making the match exhaustive and dropping the unreachable arms. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…lidity Validity is just another set of boolean buffers routed by the same selectors, so factor the hoisted-pointer unchecked gather into `gather_bits` and call it for both the values and the materialized per-branch validity, instead of duplicating the pointer-table build and `collect_bool` loop. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Summary
This PR optimizes the boolean gather path in
InterleaveArrayand adds comprehensive benchmarks comparing Vortex's implementation against arrow-rs across multiple access patterns.Changes
Benchmark Suite (
vortex-array/benches/interleave.rs)random(fully random gather),round_robin(structured merge pattern), andsingle_branch(degenerate gather to one branch)Combostruct to organize benchmark configurations and ensure Vortex and arrow-rs test identical dataarrowbenchmark function to measure arrow-rsinterleavekernel performance alongside Vortexgen_inputs()seeded by combo to ensure byte-identical data for fair comparisonBoolean Gather Optimization (
vortex-array/src/arrays/interleave/execute/bool.rs)(byte_ptr, bit_offset)pairs per value bufferBitBuffer::value()calls with directget_bit_unchecked()to avoid redundant bounds checks and struct indexing overheadBitBufferobjects per branch to enable uniform unchecked reads instead of per-rowOption/MaskdispatchTest Coverage (
vortex-array/src/arrays/interleave/mod.rs)interleave_binary_spans_word_boundary()test for two-value gather with out-of-order routing across 64-bit packing word boundariesinterleave_binary_nullable_spans_word_boundary()test for the same scenario with nullability to exercise validity packingRationale
The boolean gather is a critical path for
InterleaveArrayexecution. The optimization targets the per-bit read overhead in random-access patterns where consecutive outputs read unrelated source words—eliminating the wide struct index and redundant bounds assertion inBitBuffer::value(). The benchmark suite enables direct performance comparison with arrow-rs and validates that the optimization maintains correctness across diverse access patterns and nullability configurations.Testing
interleave_boolandinterleave_bool_nullabletests continue to passComboconfigurationshttps://claude.ai/code/session_01FmZKDRJtXNGyuSxq5sqFtE