Skip to content

Add buffer-managed-state design doc and swap-extent buffer pool prototype#36973

Draft
DAlperin wants to merge 28 commits into
MaterializeInc:mainfrom
DAlperin:dov/swap-pool-prototype
Draft

Add buffer-managed-state design doc and swap-extent buffer pool prototype#36973
DAlperin wants to merge 28 commits into
MaterializeInc:mainfrom
DAlperin:dov/swap-pool-prototype

Conversation

@DAlperin

Copy link
Copy Markdown
Member

Motivation

Working sets that exceed memory currently spill through mz_ore::pager's two blob backends (kernel swap via MADV_COLD, per-chunk scratch files), each accidentally good at half the workload: swap is lazy and translation-free but kernel-paced (per-4 KiB synchronous faults on worker threads, direct reclaim); files are controllable but pay per-chunk inode churn and eager full-cost serialization, with residency decided irrevocably at pageout time.

This PR proposes a successor architecture and includes a working prototype of its first two layers.

Design doc

doc/developer/design/20260610_buffer_managed_state.md: a buffer-managed architecture in the Umbra/LeanStore/vmcache lineage, adapted to properties those systems don't have — state is immutable once sealed, recreatable from persist (no WAL/manifest/fsync anywhere), and its lifecycle is known to the engine rather than guessed by a cache. Three layers: pooled extent store, stable-address buffer pool with write-behind and lifecycle-driven eviction, and paged sealed arrangement batches. Covers the lgalloc → swap → explicit-pager history, integration with differential's pending Chunk abstraction (TimelyDataflow/differential-dataflow#744), performance estimates against the measured baselines, an eager/lazy backing policy keyed to spine level, and an incremental migration story that coexists with node-level swap indefinitely. Because production nodes currently provision the whole disk as swap (no scratch filesystem), deployment starts with a swap-backed extent store: extents are anonymous allocations holding lz4-compressed bytes pushed to the swap device with MADV_PAGEOUT — the strategy benchmarked in CLU-108 / #36948, generalized into the pool's backing layer.

Prototype

  • mz_ore::pool: size-class anonymous VM regions (64 KiB–2 MiB) with stable slot addresses; per-chunk state machine (UnbackedResident / BackedResident / Evicted / Oversize) under a per-chunk mutex with pin guards; swap-backed extents (lz4 at the eviction boundary only — resident bytes are always uncompressed); FIFO + second-chance eviction against a live-tunable resident-bytes budget. The design's two load-bearing properties hold by construction and are test-asserted: chunks freed before eviction never cost a compression or a write (elided_frees), and re-evicting an already-backed chunk does no I/O (evictions_cheap).
  • column_pager integration: a PagedColumn::Pooled path and ColumnPager::pooled, so the columnar merge batcher works unchanged above the seam.
  • PagedRun (mz_timely_util::columnar::paged_run): a standalone Layer 3 prototype — sealed sorted runs as eagerly-evicted pool pages plus resident fence keys, zero-copy seeks via borrowed columnar views over pinned pool memory, prefetching iteration, and a streaming bounded-window merge. Not yet wired to arrangements; it exists to prove the format and the borrow-safety story.
  • Wiring for staging: new dyncfg column_paged_batcher_use_pool (default off) routes the compute batchers through the pool with the existing fraction-derived budget; enable_upsert_paged_spill now follows whichever mechanism (pool or tiered) the last config apply installed, so the storage upsert stash opts in with no storage-side changes. Nine mz_column_pool_* metrics expose the pool's counters, including the elision rate the design's estimates hinge on.

Testing

Unit tests throughout: 24 for the pool (round trips, evict/fault integrity, a slot-poisoning test proving fault-in reads the extent rather than stale MADV_DONTNEED'd memory, budget enforcement, second chance, dead-data elision, stable addresses, multithreaded smoke), batcher-level round trips through the pool with stats assertions, fault-count-exact seek tests and reference-checked bounded-window merges for PagedRun, and a mechanism-switch routing test for the config seam. The new flag is registered with mzcompose's system-parameter list and parallel-workload.

Status

Draft for discussion alongside the design doc (also up separately as the dov/buffer-managed-state-design branch). The prototype intentionally takes the design's simplest sound options — synchronous on-worker I/O, per-chunk mutexes instead of epochs, owned rehydration — all marked as such in the doc's open questions.

Generated with Claude Code

DAlperin added 25 commits June 11, 2026 21:41
Fold the June 2026 staging measurements into the buffer-managed-state
design doc: bounded accumulation at the budget floor, die-young elision
observed via spill cancellations, off-worker eviction as the de facto
executor answer for the swap-backed store, exact-size extents,
size-class coverage, the phase-scoped boundedness finding (seal drain
fixed, arrange_core materialization now an open question), and the
working-set accounting caveats.

Add a forward-looking section mapping the object-store literature
(cloud-native tiering, AnyBlob request economics, far-memory interface
results, log-structured GC) onto the extent seam, including the
persist-convergence question and the EBS-swap intermediate.
Chains shorter than MIN_PAGED_CHAIN_LEN (4 chunks) no longer route their
entries through the pager: the rebalancing cascade consumes short chains
almost immediately, so paging their chunks scheduled work the next merge
cancelled — measured under hydration load as the spill queue pinned at
its cap with cancellation rates of 100-400/s. Singleton pushes and
below-threshold merge outputs stay resident; chunks reach the pager once
they land in a chain long enough to sit out a few rebalance rounds, and
the seal's extract path pages keep/ship buffers as before.

Resident overhead is bounded by the chain-stack shape (the youngest
chain is under half its predecessor, so sub-threshold chains hold fewer
than MIN_PAGED_CHAIN_LEN chunks between them): single-digit MiB per
batcher, paid per worker per consumer. The disabled pager is safe on the
rehydration side because ColumnPager::take is variant-driven: pooled and
paged inputs rehydrate through their own handles regardless of which
pager performs the take.
…ll_worker_count

The prior name was registered in LaunchDarkly with the wrong data type,
and LD does not allow changing a parameter's type after creation; a
fresh name lets the flag be recreated correctly. Default and semantics
unchanged.
@DAlperin DAlperin force-pushed the dov/swap-pool-prototype branch from 8a4675e to e9cf871 Compare June 12, 2026 02:01
DAlperin added 2 commits June 12, 2026 00:22
Slot cycling paid a per-4KiB write fault plus a kernel page-zeroing for
every reuse: release did MADV_DONTNEED, so the next occupant refaulted
~512 pages per 2 MiB chunk and the kernel zeroed each one moments before
the insert overwrote it. Two amortizations:

Regions whose class is at least 2 MiB are now huge-page aligned
(over-map and trim) and advised MADV_HUGEPAGE, so populating a large
slot is one fault instead of one per 4 KiB, and whole-slot DONTNEED
frees whole huge pages without splitting.

The free list splits into warm and cold sides. A freed slot keeps its
physical pages (no madvise) while total warm bytes stay under an eighth
of the budget; warm reuse faults nothing and skips the zeroing, which
is safe under the existing contract that a slot's next occupant fully
overwrites every byte it reads (the poison test covers both lists).
RSS now exceeds resident_bytes by up to the warm cap, an explicit and
bounded relaxation surfaced as mz_column_pool_warm_bytes alongside a
warm_reuses_total counter.
Implements the design's write-behind semantics: with
column_paged_batcher_eager_backing set, idle spill threads pull
candidates from the eviction queue and compress unbacked chunks to
BackedResident — the chunk stays readable in its slot while a
compressed extent accumulates on the swap device — so a later
budget-driven eviction is a pure page release instead of a compression.
This recovers what made the kernel-swap backend fast (laziness, free
re-access before reclaim) while keeping eviction under engine control.

Backing is not an eviction: it proceeds while the chunk is pinned
(reads of the immutable slot coexist with compression), leaves the
second-chance touched bit alone, keeps the chunk in the eviction
queue, and counts on its own eager_backs counter rather than
evictions_compress. Budget-driven spills keep today's semantics,
including the pinned-at-dequeue cancellation. Freed-while-backing
reuses the existing cancellation windows.

Spill threads prioritize queued evictions, back chunks when idle, and
park with a timeout once everything reachable is backed.
Extents no longer page out at write: SwapExtent::write leaves the
compressed pages resident, and a new pool RSS target bounds the
compressed-but-resident tier. Eviction past the slot budget compresses
into a resident extent (or just releases the slot when eager backing
prepaid the compression); only when total pool RSS — slots + warm slots
+ resident extents — crosses the target do the oldest extents get
MADV_PAGEOUT, a microseconds madvise whose device write is the kernel's
async writeback. The cost of pressure response now falls as data cools,
and nothing is ever both uncompressed and on the device.

Reads of a paged-out extent fault it back, re-count it against the
tier, and re-enqueue it. A zero target collapses the tier and
reproduces the prior page-out-at-write behavior; a target at or below
the budget plus warm cap does the same, so misconfiguration degrades to
the status quo rather than erroring.

Budgets now derive from physical RAM (mz_ore::memory, host RAM clamped
by the cgroup limit) instead of the announced memory limit, which on
swap-provisioned nodes includes swap: resident bounds must come from
memory that can be resident. The warm pool gains an absolute 1 GiB
clamp so large budgets don't park gigabytes of idle warm slots.
apply_pool_config takes a PoolPagerConfig struct; the tier is gauged as
mz_column_pool_extent_resident_bytes / extent_pageouts_total and
configured by column_paged_batcher_pool_rss_target_fraction. The
default (0.25 of RAM, alongside the 0.05 budget default) leaves ~20% of
RAM for the compressed tier — zswap's default share, roughly RAM-sized
logical coverage at the measured 5.6x ratio.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant