Skip to content

Accurately reserve memory in the build side of hash joins #22861

@jordepic

Description

@jordepic

Describe the bug

HashJoinExec's build side reserves get_record_batch_memory_size(&batch) per collected batch. That function deduplicates shared buffers only within one batch, so when the build input emits zero-copy slices of one larger batch — as GroupedHashAggregateStream does when emitting its result in batch_size chunks — every slice is charged the full parent allocation. An aggregate output of S bytes in n slices reserves n × S for S bytes of physical memory; since the build collection cannot spill, this aborts queries that fit in memory with large headroom.

Observed in DataFusion Comet: 26GB reserved for 136MB resident (1.63M-row build side, ~200 slices), failing against a 16GiB pool share. Reporting each slice's sliced size instead would under-count — a single slice keeps the entire parent buffer alive via Arc — so the correct measure for the collection is the union of unique buffers it retains.

To Reproduce

No response

Expected behavior

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions