Skip to content

perf(scan): intra-file decode parallelism — sub-split large chunk spans#8400

Open
lukekim wants to merge 3 commits into
vortex-data:developfrom
spiceai:lukim/scan-split-large-chunks-develop
Open

perf(scan): intra-file decode parallelism — sub-split large chunk spans#8400
lukekim wants to merge 3 commits into
vortex-data:developfrom
spiceai:lukim/scan-split-large-chunks-develop

Conversation

@lukekim

@lukekim lukekim commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Summary

SplitBy::Layout now sub-divides any span between adjacent chunk boundaries wider than IDEAL_SPLIT_SIZE (100k rows) into evenly sized row-range splits, so a file with few large chunks (e.g. a single flat layout, or byte-targeted int columns that coalesce to ~262k rows/chunk) decodes across multiple cores instead of one.

Correctness

Subdivision only inserts points strictly between existing adjacent boundaries — it never moves or removes one — so the half-open ranges consumers derive (tuple_windows) remain a contiguous, non-overlapping, exact partition of the same rows. Spans at or below the cap pass through untouched (fast-path no-op). All boundary consumers (RepeatedScan, VortexFile::splits(), the DataFusion repartitioner) operate on arbitrary ranges; sub-chunk ranges were already exercised by SplitBy::RowCount. The arithmetic saturates at u64::MAX.

API/Observable behavior

For files whose merged (projected-column) chunk boundaries leave spans > 100k rows, VortexFile::splits() (incl. Python bindings) returns more, smaller ranges; scans emit smaller batches; DataFusion gets real repartitioning where a single-chunk file previously collapsed to one partition. Fine-grained files (e.g. ~8k-row string chunks from the default 1 MiB block target) are untouched.

Testing

  • Unit/property/overflow tests for subdivide_large_spans (no-op, large single chunk, mixed gaps, exact-coverage property, u64::MAX boundary).
  • E2E: 250k-row single flat chunk → splits all ≤ the cap, contiguous, exact endpoints; full + filtered scans match the unsplit data.
  • E2E (rstest): fixed-size SplitBy::RowCount scans (unaligned 33,333 and exceeds-file 300,000 cases).
  • E2E: ~120-byte string column via the default write strategy keeps its natural fine-grained chunk splits (bounded relative to the cap, not to writer defaults).
  • cargo nextest run -p vortex-layout -p vortex-file — 174 passed on this branch.

lukekim added 3 commits June 12, 2026 14:01
SplitBy::Layout now sub-divides any span between adjacent chunk boundaries
wider than IDEAL_SPLIT_SIZE (100k rows) into evenly sized row-range splits,
so files with few large chunks decode across multiple cores. Subdivision
only inserts points strictly between existing adjacent boundaries: the
half-open ranges consumers derive remain a contiguous, non-overlapping,
exact partition of the same rows. The arithmetic saturates at u64::MAX.

Tests: unit/property/overflow coverage for the subdivision helper, an
end-to-end test that a 250k-row single flat chunk scans correctly across
sub-divided splits with bounded split sizes, an rstest-parameterized
end-to-end test for fixed-size SplitBy::RowCount scans (previously only
covered at the boundary-math level), and an end-to-end test that ~120-byte
string columns written with the default strategy keep their natural ~8k-row
chunk splits untouched by the cap.

Signed-off-by: Luke Kim <80174+lukekim@users.noreply.github.com>
Introduce a test-local MAX_SPLIT_ROWS mirroring the private IDEAL_SPLIT_SIZE
instead of repeating 100_000, and bound the string-chunk test relative to the
cap (< MAX_SPLIT_ROWS / 4) rather than pinning the current ~8k repartition
default.

Signed-off-by: Luke Kim <80174+lukekim@users.noreply.github.com>
@lukekim lukekim requested a review from a team June 12, 2026 21:20
@gatesn gatesn added the action/benchmark Trigger full benchmarks to run on this PR label Jun 12, 2026
@codspeed-hq

codspeed-hq Bot commented Jun 12, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 3 improved benchmarks
❌ 3 regressed benchmarks
✅ 1531 untouched benchmarks
⏩ 10 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation varbinview_large 112.9 µs 131.5 µs -14.15%
Simulation decompress_rd[f64, (100000, 0.01)] 845.9 µs 981.6 µs -13.83%
Simulation decompress_rd[f64, (100000, 0.1)] 845.9 µs 981.6 µs -13.82%
Simulation decompress_rd[f64, (100000, 0.0)] 1,024.6 µs 845.9 µs +21.12%
Simulation decompress_rd[f32, (100000, 0.0)] 586.8 µs 499.3 µs +17.51%
Simulation encode_varbin[(1000, 2)] 176.8 µs 157.8 µs +12.08%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing spiceai:lukim/scan-split-large-chunks-develop (e8047a4) with develop (d0013ff)

Open in CodSpeed

Footnotes

  1. 10 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

action/benchmark Trigger full benchmarks to run on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants