Migrate to new stats falsification rules#8345
Conversation
## Summary Adds docs for `StatsRewriteRule` and its functions. Can be considered a follow-up for #8345, but can be individually merged. Signed-off-by: Adam Gutglick <adam@spiraldb.com>
033c73f to
5ed375f
Compare
Port file pruning to session stats rewrites Signed-off-by: "Nicholas Gates" <nick@nickgates.com> Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: "Nicholas Gates" <nick@nickgates.com> Signed-off-by: Nicholas Gates <nick@nickgates.com>
5ed375f to
e8dd011
Compare
Merging this PR will improve performance by 13.18%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | decompress_rd[f64, (100000, 0.0)] |
845.6 µs | 1,024.6 µs | -17.47% |
| ❌ | Simulation | decompress_rd[f32, (100000, 0.0)] |
499.2 µs | 587 µs | -14.96% |
| ⚡ | Simulation | chunked_bool_canonical_into[(1000, 10)] |
35.5 µs | 20.4 µs | +73.82% |
| ⚡ | Simulation | chunked_varbinview_canonical_into[(1000, 10)] |
198.1 µs | 161.6 µs | +22.57% |
| ⚡ | Simulation | chunked_varbinview_into_canonical[(1000, 10)] |
213.5 µs | 177 µs | +20.62% |
| ⚡ | Simulation | decompress_rd[f64, (100000, 0.01)] |
981.4 µs | 846 µs | +16% |
| ⚡ | Simulation | decompress_rd[f64, (100000, 0.1)] |
981.3 µs | 846 µs | +16% |
| ⚡ | Simulation | bitwise_not_vortex_buffer_mut[128] |
244.4 ns | 215.3 ns | +13.55% |
| ⚡ | Simulation | bitwise_not_vortex_buffer_mut[1024] |
304.7 ns | 275.6 ns | +10.58% |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing ngates/public-stats-rewrite-rules (ffbccee) with develop (8acef3a)
Signed-off-by: "Nicholas Gates" <nick@nickgates.com> Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: "Nicholas Gates" <nick@nickgates.com> Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: "Nicholas Gates" <nick@nickgates.com> Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
| with: | ||
| sccache: s3 | ||
| - uses: actions/checkout@df4cb1c069e1874edd31b4311f1884172cec0e10 # v6 | ||
| - uses: actions/setup-java@be666c2fcd27ec809703dec50e508c2fdc7f6654 # v5 |
| use crate::scalar::Scalar; | ||
| use crate::scalar_fn::fns::stat::StatFn; | ||
|
|
||
| /// A target that can bind abstract statistics to concrete expressions. |
There was a problem hiding this comment.
What does it mean to bind a statistic to an expression? What makes statistics abstract?
There was a problem hiding this comment.
What does it mean to bind a statistic to an expression? What makes statistics abstract?
Yeah stats stuff I am intermingling with an expressions change, so I think the "stat" expression becomes explicitly an "Expression::Placeholder", meaning it must be replaced prior to execution.
It's the same logic as our StatCatalog. Except the current StatCatalog means you have to re-run falsification over the entire expression any time your stats come from a different place, e.g. FileStats vs ZoneMap vs ArrayStats.
Here you take the falsified expression, then "bind" the stats from wherever you get them from.
|
|
||
| fn lower_stats(&self, predicate: Expression) -> VortexResult<Expression> { | ||
| let mut binder = FileStatsBinder { reader: self }; | ||
| let Some(predicate) = bind_stats(predicate, &mut binder)? else { |
There was a problem hiding this comment.
Looking at bind_stats not sure I understand how the error message here is related
| return Some(FieldPath::root()); | ||
| } | ||
|
|
||
| let field_name = expr.as_opt::<GetItem>()?; |
There was a problem hiding this comment.
unrelated to this PR - but this is a really confusing API
| binder.bind_aggregate(input, aggregate_fn, &stat_dtype) | ||
| } | ||
|
|
||
| fn null_expr(dtype: DType) -> Expression { |
There was a problem hiding this comment.
nit - just inline? I feel like these short utility functions always make the code much harder to trace
| ) -> VortexResult<Option<Expression>> { | ||
| let options = expr.as_::<StatFn>(); | ||
| let aggregate_fn = options.aggregate_fn(); | ||
| let input = expr.child(0); |
| .into_inner(); | ||
|
|
||
| #[expect(deprecated)] | ||
| let lowered = lowered.simplify_untyped()?; |
There was a problem hiding this comment.
can't you do lowered.optimize_recursive(&scope)?
| @@ -115,14 +119,52 @@ impl FileStatsLayoutReader { | |||
| Ok(result.as_bool().value() == Some(true)) | |||
| refs | ||
| } | ||
|
|
||
| fn collect_referenced_stat_field_names(expr: &Expression, refs: &mut HashSet<FieldName>) { |
There was a problem hiding this comment.
we have Node implemented for Expression, you should be able to visit the tree using that instead of hand-rolling the recursion
| } | ||
| } | ||
|
|
||
| fn bool_literal(expr: &Expression) -> Option<Option<bool>> { |
| return Ok(None); | ||
| }; | ||
| let required_stats = filter_required_stats(&lowered, binder.required_stats); | ||
| if required_stats.map().is_empty() && !matches!(bool_literal(&lowered), Some(Some(true))) { |
There was a problem hiding this comment.
maybe add a comment explaining this if statement?
| available_stats, | ||
| required_stats: Relation::new(), | ||
| }; | ||
| let Some(lowered) = bind_stats(predicate, &mut binder)? else { |
There was a problem hiding this comment.
Maybe I'm just stupid - but I think both "lowering" and "bindings" are worth defining and explaining somewhere.
|
I think I mostly have a lot of questions 😅 |
Port file pruning to session stats rewrites