Skip to content

Skip loading Parquet page index when row-group statistics already prove it cannot prune#22857

Open
RatulDawar wants to merge 4 commits into
apache:mainfrom
RatulDawar:fix/skip-page-index-when-fully-matched
Open

Skip loading Parquet page index when row-group statistics already prove it cannot prune#22857
RatulDawar wants to merge 4 commits into
apache:mainfrom
RatulDawar:fix/skip-page-index-when-fully-matched

Conversation

@RatulDawar

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

The Parquet opener was loading the page index (ColumnIndex + OffsetIndex) before row-group statistics pruning. When all surviving row groups are fully matched by row-group statistics (for example, IS NOT NULL on a non-null column), page index I/O cannot prune further and is wasted.

What changes are included in this PR?

  • Reorder the opener state machine: PrepareFilters → PruneWithStatistics → LoadPageIndex? → LoadBloomFilters
  • Skip load_page_index when there is no page-pruning predicate, no surviving row groups, or every surviving row group is fully matched
  • Add unit and integration tests for the gate and the fully-matched IS NOT NULL case

Are these changes tested?

  • cargo test -p datafusion-datasource-parquet should_load
  • cargo test -p datafusion-datasource-parquet page_index_skip
  • cargo test -p datafusion-datasource-parquet opener::test::test_page_pruning
  • cargo test -p datafusion --test parquet_integration
  • cargo clippy -p datafusion-datasource-parquet --all-targets -- -D warnings

Are there any user-facing changes?

No user-facing API changes. This reduces unnecessary Parquet page index I/O during scan planning when row-group statistics already prove no further pruning is possible.

Made with Cursor

RatulDawar and others added 3 commits June 9, 2026 00:39
…prune.

Reorder the opener so row-group statistics pruning runs before the page
index load, and skip that I/O when every surviving row group is fully matched.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions github-actions Bot added the datasource Changes to the datasource crate label Jun 9, 2026
Resolve opener test conflicts after upstream moved opener.rs to opener/mod.rs.

Co-authored-by: Cursor <cursoragent@cursor.com>
@RatulDawar

Copy link
Copy Markdown
Contributor Author

A related question came up while implementing this, can we skip page index I/O per row group (e.g. load index only for RGs that aren't fully matched)?

I checked arrow-rs (parquet 58.3.0 + latest main), but per RG page index apis doesn't seem to be avalible. We can take that implementation as a next setp to this(not sure though per page index skip would be that much beneficial or not).

@kosiew kosiew left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RatulDawar
Thanks for working on this. I think the opener invariant is the right direction, but there
is still one cached-reader path where the skip policy can be bypassed before row-group pruning gets a chance to run. I think we should tighten that up before merging.

// unnecessary I/O. We decide later if it is needed to evaluate the
// pruning predicates. Thus default to not requesting it from the
// underlying reader.
let mut options =

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this still needs one more fix in the default cached-reader path.

This change relies on the initial metadata load honoring PageIndexPolicy::Skip, but ArrowReaderMetadata::load_async(...) can still call CachedParquetFileReader::get_metadata(). That path ignores the passed ArrowReaderOptions page-index policy and calls DFParquetMetadata::fetch_metadata() with a metadata cache. From there, the metadata layer forces PageIndexPolicy::Optional whenever a metadata cache exists.

The end result is that the opener can still load ColumnIndex and OffsetIndex during metadata loading, before should_load_page_index() gets a chance to skip it for fully matched row groups.

Could you please make this opener invariant hold end to end by threading or respecting the requested skip policy through the cached reader and metadata cache path? Another workable approach would be to prevent eager page-index fetching until after row-group pruning. It would also be good to add coverage using the default ParquetSource cached-reader path.


let (_, rows) =
count_batches_and_rows(open_file(&morselizer, file).await.unwrap()).await;
assert_eq!(rows, 100);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to have: this regression test currently only checks the row count, which would still pass even if the page index were loaded and evaluated.

After the cached-reader path is fixed, could we assert the invariant more directly? For example, the test could use a counting reader or object store that records or fails on page-index range reads, or it could assert a metric or state showing that LoadPageIndex was skipped. That would make the test much better at catching future reorderings that accidentally bring the extra I/O back.

return false;
}

let is_fully_matched = row_groups.is_fully_matched();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small cleanup suggestion: this helper could encode the invariant a bit more directly with is_some_and plus any, which avoids the early return and the double-negative !all(...).

page_pruning_predicate.is_some_and(|_| {
    let fully_matched = row_groups.is_fully_matched();
    row_groups
        .row_group_indexes()
        .any(|idx| !fully_matched[idx])
})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Skip loading the Parquet page index when row-group statistics already prove it cannot prune

2 participants