Add buffer-managed-state design doc and swap-extent buffer pool prototype by DAlperin · Pull Request #36973 · MaterializeInc/materialize

DAlperin · 2026-06-10T21:59:16Z

Motivation

Working sets that exceed memory currently spill through mz_ore::pager's two blob backends (kernel swap via MADV_COLD, per-chunk scratch files), each accidentally good at half the workload: swap is lazy and translation-free but kernel-paced (per-4 KiB synchronous faults on worker threads, direct reclaim); files are controllable but pay per-chunk inode churn and eager full-cost serialization, with residency decided irrevocably at pageout time.

This PR proposes a successor architecture and includes a working prototype of its first two layers.

Design doc

doc/developer/design/20260610_buffer_managed_state.md: a buffer-managed architecture in the Umbra/LeanStore/vmcache lineage, adapted to properties those systems don't have — state is immutable once sealed, recreatable from persist (no WAL/manifest/fsync anywhere), and its lifecycle is known to the engine rather than guessed by a cache. Three layers: pooled extent store, stable-address buffer pool with write-behind and lifecycle-driven eviction, and paged sealed arrangement batches. Covers the lgalloc → swap → explicit-pager history, integration with differential's pending Chunk abstraction (TimelyDataflow/differential-dataflow#744), performance estimates against the measured baselines, an eager/lazy backing policy keyed to spine level, and an incremental migration story that coexists with node-level swap indefinitely. Because production nodes currently provision the whole disk as swap (no scratch filesystem), deployment starts with a swap-backed extent store: extents are anonymous allocations holding lz4-compressed bytes pushed to the swap device with MADV_PAGEOUT — the strategy benchmarked in CLU-108 / #36948, generalized into the pool's backing layer.

Prototype

mz_ore::pool: size-class anonymous VM regions (64 KiB–2 MiB) with stable slot addresses; per-chunk state machine (UnbackedResident / BackedResident / Evicted / Oversize) under a per-chunk mutex with pin guards; swap-backed extents (lz4 at the eviction boundary only — resident bytes are always uncompressed); FIFO + second-chance eviction against a live-tunable resident-bytes budget. The design's two load-bearing properties hold by construction and are test-asserted: chunks freed before eviction never cost a compression or a write (elided_frees), and re-evicting an already-backed chunk does no I/O (evictions_cheap).
column_pager integration: a PagedColumn::Pooled path and ColumnPager::pooled, so the columnar merge batcher works unchanged above the seam.
PagedRun (mz_timely_util::columnar::paged_run): a standalone Layer 3 prototype — sealed sorted runs as eagerly-evicted pool pages plus resident fence keys, zero-copy seeks via borrowed columnar views over pinned pool memory, prefetching iteration, and a streaming bounded-window merge. Not yet wired to arrangements; it exists to prove the format and the borrow-safety story.
Wiring for staging: new dyncfg column_paged_batcher_use_pool (default off) routes the compute batchers through the pool with the existing fraction-derived budget; enable_upsert_paged_spill now follows whichever mechanism (pool or tiered) the last config apply installed, so the storage upsert stash opts in with no storage-side changes. Nine mz_column_pool_* metrics expose the pool's counters, including the elision rate the design's estimates hinge on.

Testing

Unit tests throughout: 24 for the pool (round trips, evict/fault integrity, a slot-poisoning test proving fault-in reads the extent rather than stale MADV_DONTNEED'd memory, budget enforcement, second chance, dead-data elision, stable addresses, multithreaded smoke), batcher-level round trips through the pool with stats assertions, fault-count-exact seek tests and reference-checked bounded-window merges for PagedRun, and a mechanism-switch routing test for the config seam. The new flag is registered with mzcompose's system-parameter list and parallel-workload.

Status

Draft for discussion alongside the design doc (also up separately as the dov/buffer-managed-state-design branch). The prototype intentionally takes the design's simplest sound options — synchronous on-worker I/O, per-chunk mutexes instead of epochs, owned rehydration — all marked as such in the doc's open questions.

Generated with Claude Code

…raction

…stash pager at render

…nvariant

Fold the June 2026 staging measurements into the buffer-managed-state design doc: bounded accumulation at the budget floor, die-young elision observed via spill cancellations, off-worker eviction as the de facto executor answer for the swap-backed store, exact-size extents, size-class coverage, the phase-scoped boundedness finding (seal drain fixed, arrange_core materialization now an open question), and the working-set accounting caveats. Add a forward-looking section mapping the object-store literature (cloud-native tiering, AnyBlob request economics, far-memory interface results, log-structured GC) onto the extent seam, including the persist-convergence question and the EBS-swap intermediate.

Chains shorter than MIN_PAGED_CHAIN_LEN (4 chunks) no longer route their entries through the pager: the rebalancing cascade consumes short chains almost immediately, so paging their chunks scheduled work the next merge cancelled — measured under hydration load as the spill queue pinned at its cap with cancellation rates of 100-400/s. Singleton pushes and below-threshold merge outputs stay resident; chunks reach the pager once they land in a chain long enough to sit out a few rebalance rounds, and the seal's extract path pages keep/ship buffers as before. Resident overhead is bounded by the chain-stack shape (the youngest chain is under half its predecessor, so sub-threshold chains hold fewer than MIN_PAGED_CHAIN_LEN chunks between them): single-digit MiB per batcher, paid per worker per consumer. The disabled pager is safe on the rehydration side because ColumnPager::take is variant-driven: pooled and paged inputs rehydrate through their own handles regardless of which pager performs the take.

…ll_worker_count The prior name was registered in LaunchDarkly with the wrong data type, and LD does not allow changing a parameter's type after creation; a fresh name lets the flag be recreated correctly. Default and semantics unchanged.

Slot cycling paid a per-4KiB write fault plus a kernel page-zeroing for every reuse: release did MADV_DONTNEED, so the next occupant refaulted ~512 pages per 2 MiB chunk and the kernel zeroed each one moments before the insert overwrote it. Two amortizations: Regions whose class is at least 2 MiB are now huge-page aligned (over-map and trim) and advised MADV_HUGEPAGE, so populating a large slot is one fault instead of one per 4 KiB, and whole-slot DONTNEED frees whole huge pages without splitting. The free list splits into warm and cold sides. A freed slot keeps its physical pages (no madvise) while total warm bytes stay under an eighth of the budget; warm reuse faults nothing and skips the zeroing, which is safe under the existing contract that a slot's next occupant fully overwrites every byte it reads (the poison test covers both lists). RSS now exceeds resident_bytes by up to the warm cap, an explicit and bounded relaxation surfaced as mz_column_pool_warm_bytes alongside a warm_reuses_total counter.

Implements the design's write-behind semantics: with column_paged_batcher_eager_backing set, idle spill threads pull candidates from the eviction queue and compress unbacked chunks to BackedResident — the chunk stays readable in its slot while a compressed extent accumulates on the swap device — so a later budget-driven eviction is a pure page release instead of a compression. This recovers what made the kernel-swap backend fast (laziness, free re-access before reclaim) while keeping eviction under engine control. Backing is not an eviction: it proceeds while the chunk is pinned (reads of the immutable slot coexist with compression), leaves the second-chance touched bit alone, keeps the chunk in the eviction queue, and counts on its own eager_backs counter rather than evictions_compress. Budget-driven spills keep today's semantics, including the pinned-at-dequeue cancellation. Freed-while-backing reuses the existing cancellation windows. Spill threads prioritize queued evictions, back chunks when idle, and park with a timeout once everything reachable is backed.

Extents no longer page out at write: SwapExtent::write leaves the compressed pages resident, and a new pool RSS target bounds the compressed-but-resident tier. Eviction past the slot budget compresses into a resident extent (or just releases the slot when eager backing prepaid the compression); only when total pool RSS — slots + warm slots + resident extents — crosses the target do the oldest extents get MADV_PAGEOUT, a microseconds madvise whose device write is the kernel's async writeback. The cost of pressure response now falls as data cools, and nothing is ever both uncompressed and on the device. Reads of a paged-out extent fault it back, re-count it against the tier, and re-enqueue it. A zero target collapses the tier and reproduces the prior page-out-at-write behavior; a target at or below the budget plus warm cap does the same, so misconfiguration degrades to the status quo rather than erroring. Budgets now derive from physical RAM (mz_ore::memory, host RAM clamped by the cgroup limit) instead of the announced memory limit, which on swap-provisioned nodes includes swap: resident bounds must come from memory that can be resident. The warm pool gains an absolute 1 GiB clamp so large budgets don't park gigabytes of idle warm slots. apply_pool_config takes a PoolPagerConfig struct; the tier is gauged as mz_column_pool_extent_resident_bytes / extent_pageouts_total and configured by column_paged_batcher_pool_rss_target_fraction. The default (0.25 of RAM, alongside the 0.05 budget default) leaves ~20% of RAM for the compressed tier — zswap's default share, roughly RAM-sized logical coverage at the measured 5.6x ratio.

DAlperin force-pushed the dov/swap-pool-prototype branch from 08218b0 to 684d6ba Compare June 10, 2026 22:54

DAlperin mentioned this pull request Jun 11, 2026

design: buffer-managed dataflow state #36970

Closed

DAlperin added 25 commits June 11, 2026 21:41

design: buffer-managed dataflow state

40e2184

design: integrate buffer-managed state with differential's Chunk abst…

22ccdb4

…raction

design: incremental migration and swap coexistence

d8b7b23

design: swap-backed extent store for filesystem-less nodes

25a722a

prototype: swap-extent buffer pool, layers 2 and 3

d66f677

column-pager: wire buffer pool behind dyncfgs for staging

59da42d

column-pager: log spill mechanism resolution for debugging

db9e26f

column-pager: keep tiered singleton configured in pool mode; resolve …

a1c30b7

…stash pager at render

design: batched-seek cursors as the probe prefetch story

0e1074e

pool: off-worker spill threads for eviction I/O

df13c7a

compute-types: default pool spill threads to 2

41d3da2

pool: insert_with writes serialized columns directly into the slot

1a5c0b5

pool: 64 GiB class reservations; degrade to heap on slot exhaustion

c920501

pool: scope slot lifetime to residency, dropping the stable-address i…

123ed18

…nvariant

pool: single-flight budget enforcement over a resident-only queue

095852c

pool: size extents to the compressed payload, not lz4 worst case

48803bc

pool: add 4 and 8 MiB size classes for boundary-overshooting chunks

a0e6489

storage: drain upsert stash seal chunk-at-a-time

777f725

pool: cleanup pass from complexity review

568a0dd

storage: tidy upsert drain seal path

4ac9f0f

timely-util: collapse column pager config globals

a390c75

pool: rename elided frees to writes elided

5729723

DAlperin force-pushed the dov/swap-pool-prototype branch from 8a4675e to e9cf871 Compare June 12, 2026 02:01

DAlperin added 2 commits June 12, 2026 00:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add buffer-managed-state design doc and swap-extent buffer pool prototype#36973

Add buffer-managed-state design doc and swap-extent buffer pool prototype#36973
DAlperin wants to merge 28 commits into
MaterializeInc:mainfrom
DAlperin:dov/swap-pool-prototype

DAlperin commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DAlperin commented Jun 10, 2026

Motivation

Design doc

Prototype

Testing

Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant