[lake/lance] Add byte-based flush threshold to LanceLakeWriter by XuQianJin-Stars · Pull Request #3572 · apache/fluss

XuQianJin-Stars · 2026-07-04T05:16:03Z

Introduce a new max_bytes_per_batch Lance table property (default 0, meaning disabled) that lets the tiering writer flush a batch as soon as its underlying Arrow off-heap allocation reaches the configured number of bytes, in addition to the existing batch_size row-count threshold.

Motivation:

With very wide rows (long strings, large binary/vector columns) the row-count threshold under-flushes, driving peak allocator memory too high.
With very narrow rows the row-count threshold over-flushes and produces many tiny fragments, hurting Lance read performance.

Behavior:

batch_size semantics are preserved (defaults to 512 rows).
max_bytes_per_batch defaults to 0, which fully preserves the previous row-count-only behavior — no change for existing tables.
When max_bytes_per_batch > 0, LanceLakeWriter flushes as soon as either the row count reaches batch_size or the accumulated Arrow field-vector buffer size for the current row count reaches max_bytes_per_batch, whichever comes first.
Negative values are rejected at config parse time with IllegalArgumentException.

Purpose

Give operators a byte-oriented knob to control when LanceLakeWriter flushes an in-flight row batch to Lance, in addition to the existing row-count threshold. A fixed row-count threshold works poorly across mixed workloads (see Motivation). This aligns Lance tiering with how Iceberg/Paimon writers already expose byte-based flush controls, and bounds peak allocator memory during tiering.

Linked issue: close #xxx

Brief change log

LanceConfig: introduce the max_bytes_per_batch option and a getMaxBytesPerBatch(config) helper. Default is 0 (disabled). Negative values are rejected with IllegalArgumentException.
LanceLakeWriter: add a maxBytesPerBatch field and extract a shouldFlush() helper called from write(). shouldFlush() returns true when the row count reaches batch_size, or (when maxBytesPerBatch > 0) when the buffer size reported by VectorSchemaRoot#getBufferSizeFor(recordsCount) reaches maxBytesPerBatch. Uses getBufferSizeFor(recordsCount) rather than getValueCount() so the check is valid before finish() has propagated the row count to the vectors.
Add LanceConfigTest with 8 test methods covering default / override / zero-explicit / negative / invalid / missing-warehouse paths for both batch_size and max_bytes_per_batch.

Tests

New LanceConfigTest (8 tests) covers all parsing branches for the two thresholds.
Existing LanceTieringTest, LanceArrowUtilsTest, and ArrowDataConverterTest continue to pass unchanged.
Full module run: ./mvnw -pl fluss-lake/fluss-lake-lance clean test -Dspotless.check.skip — 27/27 pass.
Checkstyle: ./mvnw -pl fluss-lake/fluss-lake-lance checkstyle:check — 0 violations.

API and Format

No public Java API changes.
New table property max_bytes_per_batch on Lance-tiered tables. Default 0 (disabled) keeps existing tables' behavior identical, so this change is backward compatible.
batch_size semantics are preserved (default 512 rows).
No wire format, on-disk format, or Lance dataset format changes.

Documentation

The new property is documented via inline Javadoc on LanceConfig#getMaxBytesPerBatch, describing its default, its interaction with batch_size, and the sizing rationale (wide-row memory cap vs. narrow-row fragmentation).
No user-facing site docs update is bundled here since the Lance tiering config surface is not yet documented on the Fluss site; a follow-up docs PR can pick this up together with other Lance properties.

Introduce a new 'max_bytes_per_batch' Lance table property (default 0 means disabled) that lets the tiering writer flush a batch as soon as its underlying Arrow off-heap allocation reaches the configured number of bytes, in addition to the existing 'batch_size' row-count threshold. Motivation: - With very wide rows the row-count threshold under-flushes, driving peak allocator memory too high. - With very narrow rows the row-count threshold over-flushes and produces many tiny fragments, hurting Lance read performance. Behavior: - 'batch_size' semantics are preserved (defaults to 512 rows). - 'max_bytes_per_batch' set to 0 keeps the historical behavior. - A negative value is rejected with an IllegalArgumentException. - The new threshold is checked using the Arrow field vectors' current buffer size for the accumulated row count, so small numeric batches and wide string/binary batches are both handled correctly. Tests: - New LanceConfigTest covering default / override / zero / negative / invalid parsing paths for both thresholds. - Existing LanceTieringTest and Arrow util tests continue to pass.

XuQianJin-Stars force-pushed the feat/lance-batch-flush-config branch from 65f3373 to 16528c3 Compare July 4, 2026 05:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[lake/lance] Add byte-based flush threshold to LanceLakeWriter#3572

[lake/lance] Add byte-based flush threshold to LanceLakeWriter#3572
XuQianJin-Stars wants to merge 1 commit into
apache:mainfrom
XuQianJin-Stars:feat/lance-batch-flush-config

XuQianJin-Stars commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

XuQianJin-Stars commented Jul 4, 2026

Purpose

Brief change log

Tests

API and Format

Documentation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant