Fix strict metrics evaluator over-pruning files for NotEqualTo/NotIn with partial nulls by tanmayrauth · Pull Request #3521 · apache/iceberg-python

tanmayrauth · 2026-06-17T07:03:08Z

_StrictMetricsEvaluator.visit_not_equal and visit_not_in short-circuited on _can_contain_nulls / _can_contain_nans (null/NaN count > 0) and returned ROWS_MUST_MATCH without checking the value bounds. A file holding any null or NaN was therefore reported as fully matching the predicate, even when a non-null value inside the bounds did not match.

This drives _DeleteFiles (table/update/snapshot.py): ROWS_MUST_MATCH drops the whole data file without rewriting it. So delete(NotEqualTo("x", 5)) against a file with stats [null, 5] and bounds lower=upper=5 would delete the entire file, silently losing the row with value 5 that should have survived.

Every other strict ROWS_MUST_MATCH path already guards on the "only" variants (_contains_nulls_only / _contains_nans_only), matching the reference StrictMetricsEvaluator. Switch both methods to the same guard so that an all-null/all-NaN column still short-circuits to ROWS_MUST_MATCH (those rows satisfy not-equal/not-in), while a partially-null column falls through to the bounds check.

Update the existing NotIn-on-some-nulls test that encoded the buggy result and add a regression test covering the [null, value] / bounds-include-literal case for both NotEqualTo and NotIn.

Fixes #3498 (partially)

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

…with partial nulls _StrictMetricsEvaluator.visit_not_equal and visit_not_in short-circuited on _can_contain_nulls / _can_contain_nans (null/NaN count > 0) and returned ROWS_MUST_MATCH without checking the value bounds. A file holding any null or NaN was therefore reported as fully matching the predicate, even when a non-null value inside the bounds did not match. This drives _DeleteFiles (table/update/snapshot.py): ROWS_MUST_MATCH drops the whole data file without rewriting it. So delete(NotEqualTo("x", 5)) against a file with stats [null, 5] and bounds lower=upper=5 would delete the entire file, silently losing the row with value 5 that should have survived. Every other strict ROWS_MUST_MATCH path already guards on the "only" variants (_contains_nulls_only / _contains_nans_only), matching the reference StrictMetricsEvaluator. Switch both methods to the same guard so that an all-null/all-NaN column still short-circuits to ROWS_MUST_MATCH (those rows satisfy not-equal/not-in), while a partially-null column falls through to the bounds check. Update the existing NotIn-on-some-nulls test that encoded the buggy result and add a regression test covering the [null, value] / bounds-include-literal case for both NotEqualTo and NotIn. Fixes apache#3498 (partially)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix strict metrics evaluator over-pruning files for NotEqualTo/NotIn with partial nulls#3521

Fix strict metrics evaluator over-pruning files for NotEqualTo/NotIn with partial nulls#3521
tanmayrauth wants to merge 1 commit into
apache:mainfrom
tanmayrauth:fix-3498-strict-evaluator-not-eq-not-in-null-overprune

tanmayrauth commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tanmayrauth commented Jun 17, 2026

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant