Skip to content

Optimize interleave boolean gather and add comprehensive benchmarks#8350

Draft
joseph-isaacs wants to merge 8 commits into
developfrom
claude/eager-heisenberg-fxi8wx
Draft

Optimize interleave boolean gather and add comprehensive benchmarks#8350
joseph-isaacs wants to merge 8 commits into
developfrom
claude/eager-heisenberg-fxi8wx

Conversation

@joseph-isaacs

Copy link
Copy Markdown
Contributor

Summary

This PR optimizes the boolean gather path in InterleaveArray and adds comprehensive benchmarks comparing Vortex's implementation against arrow-rs across multiple access patterns.

Changes

Benchmark Suite (vortex-array/benches/interleave.rs)

  • Restructured benchmarks to test three distinct access patterns: random (fully random gather), round_robin (structured merge pattern), and single_branch (degenerate gather to one branch)
  • Added parametrization across branch counts (N=2, N=4) and nullability (nullable/non-nullable)
  • Introduced Combo struct to organize benchmark configurations and ensure Vortex and arrow-rs test identical data
  • Added arrow benchmark function to measure arrow-rs interleave kernel performance alongside Vortex
  • Unified input generation via gen_inputs() seeded by combo to ensure byte-identical data for fair comparison

Boolean Gather Optimization (vortex-array/src/arrays/interleave/execute/bool.rs)

  • Optimized the random-access gather loop by hoisting raw (byte_ptr, bit_offset) pairs per value buffer
  • Replaced BitBuffer::value() calls with direct get_bit_unchecked() to avoid redundant bounds checks and struct indexing overhead
  • Materialized validity masks into full-length BitBuffer objects per branch to enable uniform unchecked reads instead of per-row Option/Mask dispatch
  • Added detailed comments explaining the optimization rationale and safety invariants
  • Maintained correctness by validating all bounds once upfront before the tight unchecked loops

Test Coverage (vortex-array/src/arrays/interleave/mod.rs)

  • Added interleave_binary_spans_word_boundary() test for two-value gather with out-of-order routing across 64-bit packing word boundaries
  • Added interleave_binary_nullable_spans_word_boundary() test for the same scenario with nullability to exercise validity packing

Rationale

The boolean gather is a critical path for InterleaveArray execution. The optimization targets the per-bit read overhead in random-access patterns where consecutive outputs read unrelated source words—eliminating the wide struct index and redundant bounds assertion in BitBuffer::value(). The benchmark suite enables direct performance comparison with arrow-rs and validates that the optimization maintains correctness across diverse access patterns and nullability configurations.

Testing

  • Existing interleave_bool and interleave_bool_nullable tests continue to pass
  • Added two new unit tests covering word-boundary spanning scenarios with and without nullability
  • Benchmark suite validates correctness by comparing Vortex and arrow-rs outputs across all Combo configurations
  • All changes are covered by existing and new unit tests; no manual testing required

https://claude.ai/code/session_01FmZKDRJtXNGyuSxq5sqFtE

claude added 2 commits June 11, 2026 08:05
…terleave

Extends the interleave benchmark to measure the Vortex boolean execute path
against the arrow-rs `interleave` kernel on identical data, across three access
patterns (random, round_robin, single_branch), for N=2 and N=4 value arrays,
nullable and non-nullable.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…ecked reads

Profiling and disassembly of the gather inner loop showed the per-bit cost was
dominated not by the 64-bit packing (optimal for a random gather) but by the
overhead around each read: a wide `&[BitBuffer]` struct index, the redundant
bounds `assert` inside `BitBuffer::value`, and on the nullable path a per-row
`Option`/`Mask`-variant dispatch.

Pre-validate the per-row bounds once, then hoist a raw `(ptr, bit_offset)` per
value buffer and read each selected bit with `get_bit_unchecked`. Materialize
each branch's validity into one full-length `BitBuffer` so the validity gather
is the same uniform unchecked read. Measured ~1.2x faster non-nullable and
~1.4x nullable across random / round-robin / single-branch patterns.

Also adds two two-value word-boundary regression tests exercising the gather
across more than one 64-bit packing word, in the non-nullable and nullable
cases, validated against the existing scalar reference oracle.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@joseph-isaacs joseph-isaacs added the changelog/performance A performance improvement label Jun 11, 2026 — with Claude
Remove the arrow-rs `interleave` comparison arm and its input builder, keeping
the Vortex-only benchmark across the random / round_robin / single_branch
access patterns (N=2, N=4, nullable and non-nullable). No dependency changes:
every arrow crate in vortex-array is used by the crate's own source.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
@codspeed-hq

codspeed-hq Bot commented Jun 11, 2026

Copy link
Copy Markdown

Merging this PR will degrade performance by 23.61%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

❌ 4 regressed benchmarks
✅ 1524 untouched benchmarks
🆕 6 new benchmarks
⏩ 4 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_bool_canonical_into[(1000, 10)] 20.2 µs 35.6 µs -43.06%
Simulation chunked_varbinview_canonical_into[(1000, 10)] 162 µs 198.2 µs -18.3%
Simulation chunked_varbinview_into_canonical[(1000, 10)] 177.2 µs 213.2 µs -16.89%
Simulation bitwise_not_vortex_buffer_mut[128] 215.3 ns 244.4 ns -11.93%
🆕 Simulation vortex[random/n2/nonnull] N/A 147 µs N/A
🆕 Simulation vortex[random/n2/null] N/A 214.7 µs N/A
🆕 Simulation vortex[round_robin/n2/nonnull] N/A 152.8 µs N/A
🆕 Simulation vortex[round_robin/n2/null] N/A 214.3 µs N/A
🆕 Simulation vortex[random/n64/nonnull] N/A 206.1 µs N/A
🆕 Simulation vortex[random/n64/null] N/A 376 µs N/A

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/eager-heisenberg-fxi8wx (b582134) with develop (3d7bbfb)

Open in CodSpeed

Footnotes

  1. 4 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

claude added 5 commits June 11, 2026 09:23
…ctor types

Drop the `AsPrimitive<usize>` bound on the bool gather: dispatch both selectors
over their concrete unsigned width via `match_each_unsigned_integer_ptype!` and
convert to an index with a plain `as usize` on the native type, passed in as
`branch_at` / `row_at` accessors. Keeps the selector loads at their native width.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Keep only 2-child round-robin, 2-child random, and 64-child random (each
nullable and non-nullable); drop the single_branch pattern and the N=4 variants.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…> selectors

Go back to the generic `gather<A: AsPrimitive<usize>, R: AsPrimitive<usize>>`
with `.as_()` conversions, dropping the per-type `as usize` accessors. The
trait conversion avoids the `cast_possible_truncation` lint without a scoped
allow and matches the idiom used by the other bool kernels (e.g. take).

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…tern

Replace `Combo.pattern: &'static str` and the `match`/`unreachable!` on string
literals with a `Pattern` enum (RoundRobin, Random), making the match exhaustive
and dropping the unreachable arms.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…lidity

Validity is just another set of boolean buffers routed by the same selectors, so
factor the hoisted-pointer unchecked gather into `gather_bits` and call it for
both the values and the materialized per-branch validity, instead of duplicating
the pointer-table build and `collect_bool` loop.

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/performance A performance improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants