Optimize interleave boolean gather and add comprehensive benchmarks by joseph-isaacs · Pull Request #8350 · vortex-data/vortex

joseph-isaacs · 2026-06-11T09:03:29Z

Summary

This PR optimizes the boolean gather path in InterleaveArray and adds comprehensive benchmarks comparing Vortex's implementation against arrow-rs across multiple access patterns.

Changes

Benchmark Suite (vortex-array/benches/interleave.rs)

Restructured benchmarks to test three distinct access patterns: random (fully random gather), round_robin (structured merge pattern), and single_branch (degenerate gather to one branch)
Added parametrization across branch counts (N=2, N=4) and nullability (nullable/non-nullable)
Introduced Combo struct to organize benchmark configurations and ensure Vortex and arrow-rs test identical data
Added arrow benchmark function to measure arrow-rs interleave kernel performance alongside Vortex
Unified input generation via gen_inputs() seeded by combo to ensure byte-identical data for fair comparison

Boolean Gather Optimization (vortex-array/src/arrays/interleave/execute/bool.rs)

Optimized the random-access gather loop by hoisting raw (byte_ptr, bit_offset) pairs per value buffer
Replaced BitBuffer::value() calls with direct get_bit_unchecked() to avoid redundant bounds checks and struct indexing overhead
Materialized validity masks into full-length BitBuffer objects per branch to enable uniform unchecked reads instead of per-row Option/Mask dispatch
Added detailed comments explaining the optimization rationale and safety invariants
Maintained correctness by validating all bounds once upfront before the tight unchecked loops

Test Coverage (vortex-array/src/arrays/interleave/mod.rs)

Added interleave_binary_spans_word_boundary() test for two-value gather with out-of-order routing across 64-bit packing word boundaries
Added interleave_binary_nullable_spans_word_boundary() test for the same scenario with nullability to exercise validity packing

Rationale

The boolean gather is a critical path for InterleaveArray execution. The optimization targets the per-bit read overhead in random-access patterns where consecutive outputs read unrelated source words—eliminating the wide struct index and redundant bounds assertion in BitBuffer::value(). The benchmark suite enables direct performance comparison with arrow-rs and validates that the optimization maintains correctness across diverse access patterns and nullability configurations.

Testing

Existing interleave_bool and interleave_bool_nullable tests continue to pass
Added two new unit tests covering word-boundary spanning scenarios with and without nullability
Benchmark suite validates correctness by comparing Vortex and arrow-rs outputs across all Combo configurations
All changes are covered by existing and new unit tests; no manual testing required

https://claude.ai/code/session_01FmZKDRJtXNGyuSxq5sqFtE

…terleave Extends the interleave benchmark to measure the Vortex boolean execute path against the arrow-rs `interleave` kernel on identical data, across three access patterns (random, round_robin, single_branch), for N=2 and N=4 value arrays, nullable and non-nullable. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

…ecked reads Profiling and disassembly of the gather inner loop showed the per-bit cost was dominated not by the 64-bit packing (optimal for a random gather) but by the overhead around each read: a wide `&[BitBuffer]` struct index, the redundant bounds `assert` inside `BitBuffer::value`, and on the nullable path a per-row `Option`/`Mask`-variant dispatch. Pre-validate the per-row bounds once, then hoist a raw `(ptr, bit_offset)` per value buffer and read each selected bit with `get_bit_unchecked`. Materialize each branch's validity into one full-length `BitBuffer` so the validity gather is the same uniform unchecked read. Measured ~1.2x faster non-nullable and ~1.4x nullable across random / round-robin / single-branch patterns. Also adds two two-value word-boundary regression tests exercising the gather across more than one 64-bit packing word, in the non-nullable and nullable cases, validated against the existing scalar reference oracle. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Remove the arrow-rs `interleave` comparison arm and its input builder, keeping the Vortex-only benchmark across the random / round_robin / single_branch access patterns (N=2, N=4, nullable and non-nullable). No dependency changes: every arrow crate in vortex-array is used by the crate's own source. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

codspeed-hq · 2026-06-11T09:14:17Z

Merging this PR will degrade performance by 23.61%

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

❌ 4 regressed benchmarks
✅ 1524 untouched benchmarks
🆕 6 new benchmarks
⏩ 4 skipped benchmarks¹

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	Simulation	`chunked_bool_canonical_into[(1000, 10)]`	20.2 µs	35.6 µs	-43.06%
❌	Simulation	`chunked_varbinview_canonical_into[(1000, 10)]`	162 µs	198.2 µs	-18.3%
❌	Simulation	`chunked_varbinview_into_canonical[(1000, 10)]`	177.2 µs	213.2 µs	-16.89%
❌	Simulation	`bitwise_not_vortex_buffer_mut[128]`	215.3 ns	244.4 ns	-11.93%
🆕	Simulation	`vortex[random/n2/nonnull]`	N/A	147 µs	N/A
🆕	Simulation	`vortex[random/n2/null]`	N/A	214.7 µs	N/A
🆕	Simulation	`vortex[round_robin/n2/nonnull]`	N/A	152.8 µs	N/A
🆕	Simulation	`vortex[round_robin/n2/null]`	N/A	214.3 µs	N/A
🆕	Simulation	`vortex[random/n64/nonnull]`	N/A	206.1 µs	N/A
🆕	Simulation	`vortex[random/n64/null]`	N/A	376 µs	N/A

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing claude/eager-heisenberg-fxi8wx (b582134) with develop (3d7bbfb)}

4 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

…ctor types Drop the `AsPrimitive<usize>` bound on the bool gather: dispatch both selectors over their concrete unsigned width via `match_each_unsigned_integer_ptype!` and convert to an index with a plain `as usize` on the native type, passed in as `branch_at` / `row_at` accessors. Keeps the selector loads at their native width. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Keep only 2-child round-robin, 2-child random, and 64-child random (each nullable and non-nullable); drop the single_branch pattern and the N=4 variants. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

…> selectors Go back to the generic `gather<A: AsPrimitive<usize>, R: AsPrimitive<usize>>` with `.as_()` conversions, dropping the per-type `as usize` accessors. The trait conversion avoids the `cast_possible_truncation` lint without a scoped allow and matches the idiom used by the other bool kernels (e.g. take). Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

…tern Replace `Combo.pattern: &'static str` and the `match`/`unreachable!` on string literals with a `Pattern` enum (RoundRobin, Random), making the match exhaustive and dropping the unreachable arms. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

…lidity Validity is just another set of boolean buffers routed by the same selectors, so factor the hoisted-pointer unchecked gather into `gather_bits` and call it for both the values and the materialized per-branch validity, instead of duplicating the pointer-table build and `collect_bool` loop. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

claude added 2 commits June 11, 2026 08:05

joseph-isaacs added the changelog/performance A performance improvement label Jun 11, 2026 — with Claude

claude added 5 commits June 11, 2026 09:23

bench(vortex-array): trim interleave benchmark to three focused configs

5c18a56

Keep only 2-child round-robin, 2-child random, and 64-child random (each nullable and non-nullable); drop the single_branch pattern and the N=4 variants. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize interleave boolean gather and add comprehensive benchmarks#8350

Optimize interleave boolean gather and add comprehensive benchmarks#8350
joseph-isaacs wants to merge 8 commits into
developfrom
claude/eager-heisenberg-fxi8wx

joseph-isaacs commented Jun 11, 2026

Uh oh!

codspeed-hq Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joseph-isaacs commented Jun 11, 2026

Summary

Changes

Rationale

Testing

Uh oh!

codspeed-hq Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will degrade performance by 23.61%

Performance Changes

Footnotes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codspeed-hq Bot commented Jun 11, 2026 •

edited

Loading