Skip to content

Migrate to new stats falsification rules#8345

Open
gatesn wants to merge 8 commits into
developfrom
ngates/public-stats-rewrite-rules
Open

Migrate to new stats falsification rules#8345
gatesn wants to merge 8 commits into
developfrom
ngates/public-stats-rewrite-rules

Conversation

@gatesn

@gatesn gatesn commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Port file pruning to session stats rewrites

@gatesn gatesn added the changelog/chore A trivial change label Jun 10, 2026
AdamGS added a commit that referenced this pull request Jun 11, 2026
## Summary

Adds docs for `StatsRewriteRule` and its functions.  

Can be considered a follow-up for
#8345, but can be individually
merged.

Signed-off-by: Adam Gutglick <adam@spiraldb.com>
@gatesn gatesn force-pushed the ngates/public-stats-rewrite-rules branch from 033c73f to 5ed375f Compare June 11, 2026 15:49
gatesn added 2 commits June 11, 2026 11:51
Port file pruning to session stats rewrites

Signed-off-by: "Nicholas Gates" <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: "Nicholas Gates" <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
@gatesn gatesn force-pushed the ngates/public-stats-rewrite-rules branch from 5ed375f to e8dd011 Compare June 11, 2026 15:51
@codspeed-hq

codspeed-hq Bot commented Jun 11, 2026

Copy link
Copy Markdown

Merging this PR will improve performance by 13.18%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 7 improved benchmarks
❌ 2 regressed benchmarks
✅ 1527 untouched benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation decompress_rd[f64, (100000, 0.0)] 845.6 µs 1,024.6 µs -17.47%
Simulation decompress_rd[f32, (100000, 0.0)] 499.2 µs 587 µs -14.96%
Simulation chunked_bool_canonical_into[(1000, 10)] 35.5 µs 20.4 µs +73.82%
Simulation chunked_varbinview_canonical_into[(1000, 10)] 198.1 µs 161.6 µs +22.57%
Simulation chunked_varbinview_into_canonical[(1000, 10)] 213.5 µs 177 µs +20.62%
Simulation decompress_rd[f64, (100000, 0.01)] 981.4 µs 846 µs +16%
Simulation decompress_rd[f64, (100000, 0.1)] 981.3 µs 846 µs +16%
Simulation bitwise_not_vortex_buffer_mut[128] 244.4 ns 215.3 ns +13.55%
Simulation bitwise_not_vortex_buffer_mut[1024] 304.7 ns 275.6 ns +10.58%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing ngates/public-stats-rewrite-rules (ffbccee) with develop (8acef3a)

Open in CodSpeed

gatesn added 3 commits June 11, 2026 14:29
Signed-off-by: "Nicholas Gates" <nick@nickgates.com>

Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: "Nicholas Gates" <nick@nickgates.com>

Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: "Nicholas Gates" <nick@nickgates.com>

Signed-off-by: Nicholas Gates <nick@nickgates.com>
@gatesn gatesn marked this pull request as ready for review June 11, 2026 20:10
@gatesn gatesn requested a review from a team June 11, 2026 20:10
@gatesn gatesn enabled auto-merge (squash) June 11, 2026 20:11
gatesn added 2 commits June 11, 2026 16:13
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
@gatesn gatesn changed the title Make stats rewrite rules public Migrate to new stats falsification rules Jun 11, 2026
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Comment thread .github/workflows/ci.yml
with:
sccache: s3
- uses: actions/checkout@df4cb1c069e1874edd31b4311f1884172cec0e10 # v6
- uses: actions/setup-java@be666c2fcd27ec809703dec50e508c2fdc7f6654 # v5

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

use crate::scalar::Scalar;
use crate::scalar_fn::fns::stat::StatFn;

/// A target that can bind abstract statistics to concrete expressions.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean to bind a statistic to an expression? What makes statistics abstract?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean to bind a statistic to an expression? What makes statistics abstract?

Yeah stats stuff I am intermingling with an expressions change, so I think the "stat" expression becomes explicitly an "Expression::Placeholder", meaning it must be replaced prior to execution.

It's the same logic as our StatCatalog. Except the current StatCatalog means you have to re-run falsification over the entire expression any time your stats come from a different place, e.g. FileStats vs ZoneMap vs ArrayStats.

Here you take the falsified expression, then "bind" the stats from wherever you get them from.


fn lower_stats(&self, predicate: Expression) -> VortexResult<Expression> {
let mut binder = FileStatsBinder { reader: self };
let Some(predicate) = bind_stats(predicate, &mut binder)? else {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at bind_stats not sure I understand how the error message here is related

return Some(FieldPath::root());
}

let field_name = expr.as_opt::<GetItem>()?;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated to this PR - but this is a really confusing API

binder.bind_aggregate(input, aggregate_fn, &stat_dtype)
}

fn null_expr(dtype: DType) -> Expression {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - just inline? I feel like these short utility functions always make the code much harder to trace

) -> VortexResult<Option<Expression>> {
let options = expr.as_::<StatFn>();
let aggregate_fn = options.aggregate_fn();
let input = expr.child(0);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this always 0?

.into_inner();

#[expect(deprecated)]
let lowered = lowered.simplify_untyped()?;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't you do lowered.optimize_recursive(&scope)?

@@ -115,14 +119,52 @@ impl FileStatsLayoutReader {
Ok(result.as_bool().value() == Some(true))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - unwrap_or_default()

refs
}

fn collect_referenced_stat_field_names(expr: &Expression, refs: &mut HashSet<FieldName>) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have Node implemented for Expression, you should be able to visit the tree using that instead of hand-rolling the recursion

}
}

fn bool_literal(expr: &Expression) -> Option<Option<bool>> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

called once, inline

return Ok(None);
};
let required_stats = filter_required_stats(&lowered, binder.required_stats);
if required_stats.map().is_empty() && !matches!(bool_literal(&lowered), Some(Some(true))) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a comment explaining this if statement?

available_stats,
required_stats: Relation::new(),
};
let Some(lowered) = bind_stats(predicate, &mut binder)? else {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm just stupid - but I think both "lowering" and "bindings" are worth defining and explaining somewhere.

@AdamGS

AdamGS commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

I think I mostly have a lot of questions 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/chore A trivial change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants