Describe the bug
HashJoinExec's build side reserves get_record_batch_memory_size(&batch) per collected batch. That function deduplicates shared buffers only within one batch, so when the build input emits zero-copy slices of one larger batch — as GroupedHashAggregateStream does when emitting its result in batch_size chunks — every slice is charged the full parent allocation. An aggregate output of S bytes in n slices reserves n × S for S bytes of physical memory; since the build collection cannot spill, this aborts queries that fit in memory with large headroom.
Observed in DataFusion Comet: 26GB reserved for 136MB resident (1.63M-row build side, ~200 slices), failing against a 16GiB pool share. Reporting each slice's sliced size instead would under-count — a single slice keeps the entire parent buffer alive via Arc — so the correct measure for the collection is the union of unique buffers it retains.
To Reproduce
No response
Expected behavior
No response
Additional context
No response
Describe the bug
HashJoinExec's build side reserves get_record_batch_memory_size(&batch) per collected batch. That function deduplicates shared buffers only within one batch, so when the build input emits zero-copy slices of one larger batch — as GroupedHashAggregateStream does when emitting its result in batch_size chunks — every slice is charged the full parent allocation. An aggregate output of S bytes in n slices reserves n × S for S bytes of physical memory; since the build collection cannot spill, this aborts queries that fit in memory with large headroom.
Observed in DataFusion Comet: 26GB reserved for 136MB resident (1.63M-row build side, ~200 slices), failing against a 16GiB pool share. Reporting each slice's sliced size instead would under-count — a single slice keeps the entire parent buffer alive via Arc — so the correct measure for the collection is the union of unique buffers it retains.
To Reproduce
No response
Expected behavior
No response
Additional context
No response