Skip to content

GH-50007: [C++][Parquet] Add bloom filter folding to automatically size SBBF filters#50008

Open
HuaHuaY wants to merge 7 commits into
apache:mainfrom
HuaHuaY:sbbf_filters
Open

GH-50007: [C++][Parquet] Add bloom filter folding to automatically size SBBF filters#50008
HuaHuaY wants to merge 7 commits into
apache:mainfrom
HuaHuaY:sbbf_filters

Conversation

@HuaHuaY

@HuaHuaY HuaHuaY commented May 21, 2026

Copy link
Copy Markdown
Contributor

Rationale for this change

This PR follows apache/arrow-rs#9628. It supports optimizing the disk usage of the Bloom filter. So specifying an ndv value larger than the actual value will not affect disk usage.

Bloom filters now support folding mode: allocate a conservatively large filter (sized for worst-case NDV), insert all values during writing, then fold down at flush time to meet a target FPP. This eliminates the need to guess NDV upfront and produces optimally-sized filters automatically.

What changes are included in this PR?

BloomFilterBuilder will try to fold the bloom filter before writing it to the output stream.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

The type of ndv in BloomFilterOptions is changed from int32_t to std::optional<int64_t>. And the argument type of OptimalNumOfBytes and OptimalNumOfBits in BlockSplitBloomFilter is changed from uint32_t ndv to uint64_t ndv.

Add a new field fold in BloomFilterOptions and default value is true.

@HuaHuaY HuaHuaY requested a review from wgtmac as a code owner May 21, 2026 10:19
@github-actions github-actions Bot added the awaiting review Awaiting review label May 21, 2026
@github-actions

Copy link
Copy Markdown

⚠️ GitHub issue #50007 has been automatically assigned in GitHub to PR creator.

Comment thread cpp/src/parquet/bloom_filter_writer.cc Outdated
@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 21, 2026
@HuaHuaY

HuaHuaY commented May 21, 2026

Copy link
Copy Markdown
Contributor Author

@wgtmac @alamb @etseidl @emkornfield Please take a look.

@alamb

alamb commented May 21, 2026

Copy link
Copy Markdown
Contributor

I am not likely to have time to review C++ code in the arrow repository unfortunately

@wgtmac wgtmac left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HuaHuaY for adding this quickly!

Comment thread cpp/src/parquet/properties.h
Comment thread cpp/src/parquet/properties.h Outdated
std::to_string(bloom_filter_options.fpp));
}
if (bloom_filter_options.ndv.has_value() && bloom_filter_options.ndv.value() < 0) {
throw ParquetException("Bloom filter number of distinct values must be >= 0, got " +

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the expected behavior of 0?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will create a smallest bloom filter.

Comment thread cpp/src/parquet/bloom_filter.h
Comment thread cpp/src/parquet/bloom_filter.h Outdated
Comment thread cpp/src/parquet/bloom_filter.cc Outdated
Comment thread cpp/src/parquet/bloom_filter.cc
Comment thread cpp/src/parquet/bloom_filter_reader_writer_test.cc Outdated
@wgtmac

wgtmac commented May 29, 2026

Copy link
Copy Markdown
Member

cc @mapleFU @adamreeve

@wgtmac wgtmac left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM. I left some nits.

Comment thread cpp/src/parquet/properties.h Outdated
Comment thread cpp/src/parquet/bloom_filter.cc Outdated
Comment thread cpp/src/parquet/properties.h

@mapleFU mapleFU left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM

Comment thread cpp/src/parquet/bloom_filter_writer.cc Outdated
Comment thread cpp/src/parquet/bloom_filter.cc Outdated
@HuaHuaY

HuaHuaY commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

@pitrou @mapleFU Please take a look.

Comment thread cpp/src/parquet/bloom_filter.cc Outdated
Comment thread cpp/src/parquet/bloom_filter.cc
Comment thread cpp/src/parquet/bloom_filter.cc
}
++num_folds;
}
return num_folds;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this algorithm the actual size reduction will always be a power of 2 (group_size = UINT32_C(1) << num_folds). Why aren't we trying to be more granular?

@HuaHuaY HuaHuaY Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BlockSplitBloomFilter::Init will check (num_bytes & (num_bytes - 1)) != 0. I didn't find this limitation in the Parquet documentation. But If we break the rule, old parquet reader will not be able to read the bloom filter.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. We probably don't want to produce data that would be incompatible with old readers.

Does the power-of-two constraint serve a purpose? Perhaps we can remove it in a separate PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, can you add a comment somewhere mentioning this restriction?

@HuaHuaY HuaHuaY Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have add a comment in front of group_size.

  // A fold group is a consecutive run of blocks ORed into one output block.
  // Keeping the group size as (1 << num_folds) preserves a power-of-two bitset
  // size. Folding by this power-of-two group size keeps the old-to-new bucket
  // remapping aligned with bucket lookup and avoids false negatives.
  const uint32_t group_size = UINT32_C(1) << num_folds;

After more thinking, I think the actual size reduction must be a power of 2. Because the block index is calculated by static_cast<uint32_t>(((hash >> 32) * NumBlocks()) >> 32);, which is required by parquet's document. And we must ensure that the calculated block index is the same before and after the fold.

Comment thread cpp/src/parquet/bloom_filter_writer.cc Outdated
Comment thread cpp/src/parquet/bloom_filter_reader_writer_test.cc
Comment thread cpp/src/parquet/bloom_filter.cc Outdated
(static_cast<double>(num_blocks) * kBytesPerFilterBlock * 8);
const auto max_folds = static_cast<uint32_t>(std::countr_zero(num_blocks));

if (avg_fill == 0.0) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I little bit forgot would this really happens when writing a parquet file?

@HuaHuaY HuaHuaY Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all values in a column chunk are null, avg_fill will be 0.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it still need a BF or fold in this scenerio? Or this path would lead to zero cost folding?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are differences between "not have a bloom filter" and "bloom filter has no values". The latter can filter every not null values. And there is currently no way to indicate that a bloom filter exists but has no value through metadata.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about
(1) without folding, just replace to a smallest one without any copying

And there is currently no way to indicate that a bloom filter exists but has no value through metadata.

In theory you're right. In production, I believe null_count == num_values works?

@HuaHuaY HuaHuaY Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without folding, just replace to a smallest one without any copying

Good idea. I will change the code soon.

I believe null_count == num_values works

I don't think it's good to mix two separate components. Also, null_count is an optional value and may not actually exist.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the commits and now only fold when total_set_bits is not equal to 0.

Comment thread cpp/src/parquet/bloom_filter.cc Outdated
const auto* bitset32 = reinterpret_cast<const uint32_t*>(data_->data());
const uint32_t num_words = num_bytes_ / static_cast<uint32_t>(sizeof(uint32_t));
for (uint32_t i = 0; i < num_words; ++i) {
total_set_bits += static_cast<uint64_t>(std::popcount(bitset32[i]));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know whether internal::CountSetBits easy to understand here ( though popcount is right and a bit faster)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changd to internal::CountSetBits. internal::CountSetBits may be faster because it counts once every 64 bits.

Comment thread cpp/src/parquet/bloom_filter_reader_writer_test.cc
@HuaHuaY HuaHuaY force-pushed the sbbf_filters branch 2 times, most recently from d26cb2e to d57601b Compare June 4, 2026 09:34
@HuaHuaY

HuaHuaY commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

@pitrou Please take a look again.

Comment thread cpp/src/parquet/bloom_filter_reader_writer_test.cc Outdated
Comment thread cpp/src/parquet/bloom_filter.cc
EXPECT_TRUE(filter.FindHash(hash));
}

if (test_case.fold && test_case.inserted_count > 0) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if not folding or not inserting, we can still test that the sample fpp is below kFpp, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can indeed test that sample fpp is less than kFpp. There's only one scenario where sample fpp will be greater than kFpp: manually specifying an excessively small ndv and inserting too much data. However, there's no such test available here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants