[lake/lance] Add byte-based flush threshold to LanceLakeWriter#3572
Open
XuQianJin-Stars wants to merge 1 commit into
Open
[lake/lance] Add byte-based flush threshold to LanceLakeWriter#3572XuQianJin-Stars wants to merge 1 commit into
XuQianJin-Stars wants to merge 1 commit into
Conversation
Introduce a new 'max_bytes_per_batch' Lance table property (default 0 means disabled) that lets the tiering writer flush a batch as soon as its underlying Arrow off-heap allocation reaches the configured number of bytes, in addition to the existing 'batch_size' row-count threshold. Motivation: - With very wide rows the row-count threshold under-flushes, driving peak allocator memory too high. - With very narrow rows the row-count threshold over-flushes and produces many tiny fragments, hurting Lance read performance. Behavior: - 'batch_size' semantics are preserved (defaults to 512 rows). - 'max_bytes_per_batch' set to 0 keeps the historical behavior. - A negative value is rejected with an IllegalArgumentException. - The new threshold is checked using the Arrow field vectors' current buffer size for the accumulated row count, so small numeric batches and wide string/binary batches are both handled correctly. Tests: - New LanceConfigTest covering default / override / zero / negative / invalid parsing paths for both thresholds. - Existing LanceTieringTest and Arrow util tests continue to pass.
65f3373 to
16528c3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Introduce a new
max_bytes_per_batchLance table property (default0, meaning disabled) that lets the tiering writer flush a batch as soon as its underlying Arrow off-heap allocation reaches the configured number of bytes, in addition to the existingbatch_sizerow-count threshold.Motivation:
Behavior:
batch_sizesemantics are preserved (defaults to512rows).max_bytes_per_batchdefaults to0, which fully preserves the previous row-count-only behavior — no change for existing tables.max_bytes_per_batch > 0,LanceLakeWriterflushes as soon as either the row count reachesbatch_sizeor the accumulated Arrow field-vector buffer size for the current row count reachesmax_bytes_per_batch, whichever comes first.IllegalArgumentException.Purpose
Give operators a byte-oriented knob to control when
LanceLakeWriterflushes an in-flight row batch to Lance, in addition to the existing row-count threshold. A fixed row-count threshold works poorly across mixed workloads (see Motivation). This aligns Lance tiering with how Iceberg/Paimon writers already expose byte-based flush controls, and bounds peak allocator memory during tiering.Linked issue: close #xxx
Brief change log
LanceConfig: introduce themax_bytes_per_batchoption and agetMaxBytesPerBatch(config)helper. Default is0(disabled). Negative values are rejected withIllegalArgumentException.LanceLakeWriter: add amaxBytesPerBatchfield and extract ashouldFlush()helper called fromwrite().shouldFlush()returns true when the row count reachesbatch_size, or (whenmaxBytesPerBatch > 0) when the buffer size reported byVectorSchemaRoot#getBufferSizeFor(recordsCount)reachesmaxBytesPerBatch. UsesgetBufferSizeFor(recordsCount)rather thangetValueCount()so the check is valid beforefinish()has propagated the row count to the vectors.LanceConfigTestwith 8 test methods covering default / override / zero-explicit / negative / invalid / missing-warehouse paths for bothbatch_sizeandmax_bytes_per_batch.Tests
LanceConfigTest(8 tests) covers all parsing branches for the two thresholds.LanceTieringTest,LanceArrowUtilsTest, andArrowDataConverterTestcontinue to pass unchanged../mvnw -pl fluss-lake/fluss-lake-lance clean test -Dspotless.check.skip— 27/27 pass../mvnw -pl fluss-lake/fluss-lake-lance checkstyle:check— 0 violations.API and Format
max_bytes_per_batchon Lance-tiered tables. Default0(disabled) keeps existing tables' behavior identical, so this change is backward compatible.batch_sizesemantics are preserved (default512rows).Documentation
LanceConfig#getMaxBytesPerBatch, describing its default, its interaction withbatch_size, and the sizing rationale (wide-row memory cap vs. narrow-row fragmentation).