perf(scan): intra-file decode parallelism — sub-split large chunk spans#8400
perf(scan): intra-file decode parallelism — sub-split large chunk spans#8400lukekim wants to merge 3 commits into
Conversation
SplitBy::Layout now sub-divides any span between adjacent chunk boundaries wider than IDEAL_SPLIT_SIZE (100k rows) into evenly sized row-range splits, so files with few large chunks decode across multiple cores. Subdivision only inserts points strictly between existing adjacent boundaries: the half-open ranges consumers derive remain a contiguous, non-overlapping, exact partition of the same rows. The arithmetic saturates at u64::MAX. Tests: unit/property/overflow coverage for the subdivision helper, an end-to-end test that a 250k-row single flat chunk scans correctly across sub-divided splits with bounded split sizes, an rstest-parameterized end-to-end test for fixed-size SplitBy::RowCount scans (previously only covered at the boundary-math level), and an end-to-end test that ~120-byte string columns written with the default strategy keep their natural ~8k-row chunk splits untouched by the cap. Signed-off-by: Luke Kim <80174+lukekim@users.noreply.github.com>
Introduce a test-local MAX_SPLIT_ROWS mirroring the private IDEAL_SPLIT_SIZE instead of repeating 100_000, and bound the string-chunk test relative to the cap (< MAX_SPLIT_ROWS / 4) rather than pinning the current ~8k repartition default. Signed-off-by: Luke Kim <80174+lukekim@users.noreply.github.com>
Merging this PR will not alter performance
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | varbinview_large |
112.9 µs | 131.5 µs | -14.15% |
| ❌ | Simulation | decompress_rd[f64, (100000, 0.01)] |
845.9 µs | 981.6 µs | -13.83% |
| ❌ | Simulation | decompress_rd[f64, (100000, 0.1)] |
845.9 µs | 981.6 µs | -13.82% |
| ⚡ | Simulation | decompress_rd[f64, (100000, 0.0)] |
1,024.6 µs | 845.9 µs | +21.12% |
| ⚡ | Simulation | decompress_rd[f32, (100000, 0.0)] |
586.8 µs | 499.3 µs | +17.51% |
| ⚡ | Simulation | encode_varbin[(1000, 2)] |
176.8 µs | 157.8 µs | +12.08% |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing spiceai:lukim/scan-split-large-chunks-develop (e8047a4) with develop (d0013ff)
Footnotes
-
10 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
Summary
SplitBy::Layoutnow sub-divides any span between adjacent chunk boundaries wider thanIDEAL_SPLIT_SIZE(100k rows) into evenly sized row-range splits, so a file with few large chunks (e.g. a single flat layout, or byte-targeted int columns that coalesce to ~262k rows/chunk) decodes across multiple cores instead of one.Correctness
Subdivision only inserts points strictly between existing adjacent boundaries — it never moves or removes one — so the half-open ranges consumers derive (
tuple_windows) remain a contiguous, non-overlapping, exact partition of the same rows. Spans at or below the cap pass through untouched (fast-path no-op). All boundary consumers (RepeatedScan,VortexFile::splits(), the DataFusion repartitioner) operate on arbitrary ranges; sub-chunk ranges were already exercised bySplitBy::RowCount. The arithmetic saturates atu64::MAX.API/Observable behavior
For files whose merged (projected-column) chunk boundaries leave spans > 100k rows,
VortexFile::splits()(incl. Python bindings) returns more, smaller ranges; scans emit smaller batches; DataFusion gets real repartitioning where a single-chunk file previously collapsed to one partition. Fine-grained files (e.g. ~8k-row string chunks from the default 1 MiB block target) are untouched.Testing
subdivide_large_spans(no-op, large single chunk, mixed gaps, exact-coverage property,u64::MAXboundary).SplitBy::RowCountscans (unaligned 33,333 and exceeds-file 300,000 cases).cargo nextest run -p vortex-layout -p vortex-file— 174 passed on this branch.