Add buffer-managed-state design doc and swap-extent buffer pool prototype#36973
Draft
DAlperin wants to merge 28 commits into
Draft
Add buffer-managed-state design doc and swap-extent buffer pool prototype#36973DAlperin wants to merge 28 commits into
DAlperin wants to merge 28 commits into
Conversation
08218b0 to
684d6ba
Compare
…stash pager at render
Fold the June 2026 staging measurements into the buffer-managed-state design doc: bounded accumulation at the budget floor, die-young elision observed via spill cancellations, off-worker eviction as the de facto executor answer for the swap-backed store, exact-size extents, size-class coverage, the phase-scoped boundedness finding (seal drain fixed, arrange_core materialization now an open question), and the working-set accounting caveats. Add a forward-looking section mapping the object-store literature (cloud-native tiering, AnyBlob request economics, far-memory interface results, log-structured GC) onto the extent seam, including the persist-convergence question and the EBS-swap intermediate.
Chains shorter than MIN_PAGED_CHAIN_LEN (4 chunks) no longer route their entries through the pager: the rebalancing cascade consumes short chains almost immediately, so paging their chunks scheduled work the next merge cancelled — measured under hydration load as the spill queue pinned at its cap with cancellation rates of 100-400/s. Singleton pushes and below-threshold merge outputs stay resident; chunks reach the pager once they land in a chain long enough to sit out a few rebalance rounds, and the seal's extract path pages keep/ship buffers as before. Resident overhead is bounded by the chain-stack shape (the youngest chain is under half its predecessor, so sub-threshold chains hold fewer than MIN_PAGED_CHAIN_LEN chunks between them): single-digit MiB per batcher, paid per worker per consumer. The disabled pager is safe on the rehydration side because ColumnPager::take is variant-driven: pooled and paged inputs rehydrate through their own handles regardless of which pager performs the take.
…ll_worker_count The prior name was registered in LaunchDarkly with the wrong data type, and LD does not allow changing a parameter's type after creation; a fresh name lets the flag be recreated correctly. Default and semantics unchanged.
8a4675e to
e9cf871
Compare
Slot cycling paid a per-4KiB write fault plus a kernel page-zeroing for every reuse: release did MADV_DONTNEED, so the next occupant refaulted ~512 pages per 2 MiB chunk and the kernel zeroed each one moments before the insert overwrote it. Two amortizations: Regions whose class is at least 2 MiB are now huge-page aligned (over-map and trim) and advised MADV_HUGEPAGE, so populating a large slot is one fault instead of one per 4 KiB, and whole-slot DONTNEED frees whole huge pages without splitting. The free list splits into warm and cold sides. A freed slot keeps its physical pages (no madvise) while total warm bytes stay under an eighth of the budget; warm reuse faults nothing and skips the zeroing, which is safe under the existing contract that a slot's next occupant fully overwrites every byte it reads (the poison test covers both lists). RSS now exceeds resident_bytes by up to the warm cap, an explicit and bounded relaxation surfaced as mz_column_pool_warm_bytes alongside a warm_reuses_total counter.
Implements the design's write-behind semantics: with column_paged_batcher_eager_backing set, idle spill threads pull candidates from the eviction queue and compress unbacked chunks to BackedResident — the chunk stays readable in its slot while a compressed extent accumulates on the swap device — so a later budget-driven eviction is a pure page release instead of a compression. This recovers what made the kernel-swap backend fast (laziness, free re-access before reclaim) while keeping eviction under engine control. Backing is not an eviction: it proceeds while the chunk is pinned (reads of the immutable slot coexist with compression), leaves the second-chance touched bit alone, keeps the chunk in the eviction queue, and counts on its own eager_backs counter rather than evictions_compress. Budget-driven spills keep today's semantics, including the pinned-at-dequeue cancellation. Freed-while-backing reuses the existing cancellation windows. Spill threads prioritize queued evictions, back chunks when idle, and park with a timeout once everything reachable is backed.
Extents no longer page out at write: SwapExtent::write leaves the compressed pages resident, and a new pool RSS target bounds the compressed-but-resident tier. Eviction past the slot budget compresses into a resident extent (or just releases the slot when eager backing prepaid the compression); only when total pool RSS — slots + warm slots + resident extents — crosses the target do the oldest extents get MADV_PAGEOUT, a microseconds madvise whose device write is the kernel's async writeback. The cost of pressure response now falls as data cools, and nothing is ever both uncompressed and on the device. Reads of a paged-out extent fault it back, re-count it against the tier, and re-enqueue it. A zero target collapses the tier and reproduces the prior page-out-at-write behavior; a target at or below the budget plus warm cap does the same, so misconfiguration degrades to the status quo rather than erroring. Budgets now derive from physical RAM (mz_ore::memory, host RAM clamped by the cgroup limit) instead of the announced memory limit, which on swap-provisioned nodes includes swap: resident bounds must come from memory that can be resident. The warm pool gains an absolute 1 GiB clamp so large budgets don't park gigabytes of idle warm slots. apply_pool_config takes a PoolPagerConfig struct; the tier is gauged as mz_column_pool_extent_resident_bytes / extent_pageouts_total and configured by column_paged_batcher_pool_rss_target_fraction. The default (0.25 of RAM, alongside the 0.05 budget default) leaves ~20% of RAM for the compressed tier — zswap's default share, roughly RAM-sized logical coverage at the measured 5.6x ratio.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Working sets that exceed memory currently spill through
mz_ore::pager's two blob backends (kernel swap viaMADV_COLD, per-chunk scratch files), each accidentally good at half the workload: swap is lazy and translation-free but kernel-paced (per-4 KiB synchronous faults on worker threads, direct reclaim); files are controllable but pay per-chunk inode churn and eager full-cost serialization, with residency decided irrevocably at pageout time.This PR proposes a successor architecture and includes a working prototype of its first two layers.
Design doc
doc/developer/design/20260610_buffer_managed_state.md: a buffer-managed architecture in the Umbra/LeanStore/vmcache lineage, adapted to properties those systems don't have — state is immutable once sealed, recreatable from persist (no WAL/manifest/fsync anywhere), and its lifecycle is known to the engine rather than guessed by a cache. Three layers: pooled extent store, stable-address buffer pool with write-behind and lifecycle-driven eviction, and paged sealed arrangement batches. Covers the lgalloc → swap → explicit-pager history, integration with differential's pendingChunkabstraction (TimelyDataflow/differential-dataflow#744), performance estimates against the measured baselines, an eager/lazy backing policy keyed to spine level, and an incremental migration story that coexists with node-level swap indefinitely. Because production nodes currently provision the whole disk as swap (no scratch filesystem), deployment starts with a swap-backed extent store: extents are anonymous allocations holding lz4-compressed bytes pushed to the swap device withMADV_PAGEOUT— the strategy benchmarked in CLU-108 / #36948, generalized into the pool's backing layer.Prototype
mz_ore::pool: size-class anonymous VM regions (64 KiB–2 MiB) with stable slot addresses; per-chunk state machine (UnbackedResident/BackedResident/Evicted/Oversize) under a per-chunk mutex with pin guards; swap-backed extents (lz4 at the eviction boundary only — resident bytes are always uncompressed); FIFO + second-chance eviction against a live-tunable resident-bytes budget. The design's two load-bearing properties hold by construction and are test-asserted: chunks freed before eviction never cost a compression or a write (elided_frees), and re-evicting an already-backed chunk does no I/O (evictions_cheap).column_pagerintegration: aPagedColumn::Pooledpath andColumnPager::pooled, so the columnar merge batcher works unchanged above the seam.PagedRun(mz_timely_util::columnar::paged_run): a standalone Layer 3 prototype — sealed sorted runs as eagerly-evicted pool pages plus resident fence keys, zero-copy seeks via borrowed columnar views over pinned pool memory, prefetching iteration, and a streaming bounded-window merge. Not yet wired to arrangements; it exists to prove the format and the borrow-safety story.column_paged_batcher_use_pool(default off) routes the compute batchers through the pool with the existing fraction-derived budget;enable_upsert_paged_spillnow follows whichever mechanism (pool or tiered) the last config apply installed, so the storage upsert stash opts in with no storage-side changes. Ninemz_column_pool_*metrics expose the pool's counters, including the elision rate the design's estimates hinge on.Testing
Unit tests throughout: 24 for the pool (round trips, evict/fault integrity, a slot-poisoning test proving fault-in reads the extent rather than stale
MADV_DONTNEED'd memory, budget enforcement, second chance, dead-data elision, stable addresses, multithreaded smoke), batcher-level round trips through the pool with stats assertions, fault-count-exact seek tests and reference-checked bounded-window merges forPagedRun, and a mechanism-switch routing test for the config seam. The new flag is registered with mzcompose's system-parameter list and parallel-workload.Status
Draft for discussion alongside the design doc (also up separately as the
dov/buffer-managed-state-designbranch). The prototype intentionally takes the design's simplest sound options — synchronous on-worker I/O, per-chunk mutexes instead of epochs, owned rehydration — all marked as such in the doc's open questions.Generated with Claude Code