Skip to content

Fix strict metrics evaluator over-pruning files for NotEqualTo/NotIn with partial nulls#3521

Open
tanmayrauth wants to merge 1 commit into
apache:mainfrom
tanmayrauth:fix-3498-strict-evaluator-not-eq-not-in-null-overprune
Open

Fix strict metrics evaluator over-pruning files for NotEqualTo/NotIn with partial nulls#3521
tanmayrauth wants to merge 1 commit into
apache:mainfrom
tanmayrauth:fix-3498-strict-evaluator-not-eq-not-in-null-overprune

Conversation

@tanmayrauth

Copy link
Copy Markdown

_StrictMetricsEvaluator.visit_not_equal and visit_not_in short-circuited on _can_contain_nulls / _can_contain_nans (null/NaN count > 0) and returned ROWS_MUST_MATCH without checking the value bounds. A file holding any null or NaN was therefore reported as fully matching the predicate, even when a non-null value inside the bounds did not match.

This drives _DeleteFiles (table/update/snapshot.py): ROWS_MUST_MATCH drops the whole data file without rewriting it. So delete(NotEqualTo("x", 5)) against a file with stats [null, 5] and bounds lower=upper=5 would delete the entire file, silently losing the row with value 5 that should have survived.

Every other strict ROWS_MUST_MATCH path already guards on the "only" variants (_contains_nulls_only / _contains_nans_only), matching the reference StrictMetricsEvaluator. Switch both methods to the same guard so that an all-null/all-NaN column still short-circuits to ROWS_MUST_MATCH (those rows satisfy not-equal/not-in), while a partially-null column falls through to the bounds check.

Update the existing NotIn-on-some-nulls test that encoded the buggy result and add a regression test covering the [null, value] / bounds-include-literal case for both NotEqualTo and NotIn.

Fixes #3498 (partially)

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

…with partial nulls

_StrictMetricsEvaluator.visit_not_equal and visit_not_in short-circuited on
_can_contain_nulls / _can_contain_nans (null/NaN count > 0) and returned
ROWS_MUST_MATCH without checking the value bounds. A file holding any null or
NaN was therefore reported as fully matching the predicate, even when a
non-null value inside the bounds did not match.

This drives _DeleteFiles (table/update/snapshot.py): ROWS_MUST_MATCH drops the
whole data file without rewriting it. So delete(NotEqualTo("x", 5)) against a
file with stats [null, 5] and bounds lower=upper=5 would delete the entire
file, silently losing the row with value 5 that should have survived.

Every other strict ROWS_MUST_MATCH path already guards on the "only" variants
(_contains_nulls_only / _contains_nans_only), matching the reference
StrictMetricsEvaluator. Switch both methods to the same guard so that an
all-null/all-NaN column still short-circuits to ROWS_MUST_MATCH (those rows
satisfy not-equal/not-in), while a partially-null column falls through to the
bounds check.

Update the existing NotIn-on-some-nulls test that encoded the buggy result and
add a regression test covering the [null, value] / bounds-include-literal case
for both NotEqualTo and NotIn.

Fixes apache#3498 (partially)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Visitor and evaluator edge cases can over-prune files or mishandle nulls

1 participant