Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Don't require data to be sorted by by column in rolling_*_by operations #16249

Merged
merged 3 commits into from
May 20, 2024

Conversation

MarcoGorelli
Copy link
Collaborator

@MarcoGorelli MarcoGorelli commented May 15, 2024

Needs rebasing onto #16247

In summary this splits rolling_apply_agg_window into:

  • rolling_apply_agg_window_sorted (no difference to the current rolling_apply_agg_window, this is the fastpath when data is already known to be sorted by by)
  • rolling_apply_agg_window: doesn't require the data to be sorted by by, which is why it needs an extra argument (sorting_indices) which is used to place the results in the right place of the out Vec
  • deprecated warn_if_unsorted on the Python side, removed it from the Rust side

The gist of the change is rolling_apply_agg_window in crates/polars-time/src/chunkedarray/rolling_window/rolling_kernels/no_nulls.rs

Perf impact:

  • for data already known to be sorted: none
  • for data not known to be sorted, an extra arg_sort is performed in order to create sorting_indices. Once this has been created, it's just used to place the elements in the out Vec in the right place (rolling_apply_agg_window)

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels May 15, 2024
Copy link

codecov bot commented May 15, 2024

Codecov Report

Attention: Patch coverage is 86.84211% with 30 lines in your changes are missing coverage. Please review.

Project coverage is 80.78%. Comparing base (98a2d9b) to head (7326da2).
Report is 10 commits behind head on main.

Files Patch % Lines
py-polars/polars/expr/expr.py 12.50% 7 Missing and 7 partials ⚠️
...s-time/src/chunkedarray/rolling_window/dispatch.rs 80.39% 10 Missing ⚠️
...edarray/rolling_window/rolling_kernels/no_nulls.rs 96.27% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16249      +/-   ##
==========================================
- Coverage   80.82%   80.78%   -0.04%     
==========================================
  Files        1393     1393              
  Lines      179341   179517     +176     
  Branches     2918     2929      +11     
==========================================
+ Hits       144955   145031      +76     
- Misses      33884    33976      +92     
- Partials      502      510       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@MarcoGorelli MarcoGorelli force-pushed the sort-in-rolling-by branch 2 times, most recently from 13522e6 to ce172cc Compare May 16, 2024 09:35
@MarcoGorelli MarcoGorelli force-pushed the sort-in-rolling-by branch 4 times, most recently from f0ae832 to 2021ca2 Compare May 16, 2024 11:29
// start with a dummy index, will be overwritten on first iteration.
let mut agg_window = Agg::new(values, 0, 0, params);

let mut out = zeroed_vec(values.len());
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @orlp - I've tried using zeroed_vec here, and creating a validity mask separately. Any feedback on how I've done this would be appreciated if possible, thanks 🙏

@MarcoGorelli MarcoGorelli force-pushed the sort-in-rolling-by branch 2 times, most recently from e38bc8c to 23beb74 Compare May 16, 2024 21:43
let mut agg_window = Agg::new(values, 0, 0, params);

let mut out = zeroed_vec(values.len());
let mut null_positions = Vec::with_capacity(values.len());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this allocate this much?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, just pushing an update

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just chatted with Orson about this

The difficulty is that it's not known in advance (before starting the loop) whether there will be nulls
He's suggested making an Option<MutableBitmap>, which gets set on the first null

There is an extra if validity.is_none() check, but nulls produced by the rolling operation should be fairly rare anyway

I've pushed an update - thanks @orlp, much appreciated 🙏

@MarcoGorelli MarcoGorelli marked this pull request as draft May 17, 2024 08:52
@MarcoGorelli MarcoGorelli changed the title feat: Don't require data to be sorted by by column in rolling_*_by operations WIP feat: Don't require data to be sorted by by column in rolling_*_by operations May 17, 2024
@MarcoGorelli MarcoGorelli marked this pull request as ready for review May 17, 2024 09:17
@MarcoGorelli MarcoGorelli changed the title WIP feat: Don't require data to be sorted by by column in rolling_*_by operations feat: Don't require data to be sorted by by column in rolling_*_by operations May 17, 2024
@@ -99,25 +99,29 @@ where
// `by`, which has already been checked to be the same length as the values.
unsafe { *out.get_unchecked_mut(*out_idx as usize) = res };
} else {
null_positions.push(*out_idx as usize)
if validity.is_none() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put this in a closure as that ensure we have a single code block and call that closure twice.

I think the overhead of the function call is not too much here as you said, you don't expect many nulls.

@ritchie46
Copy link
Member

Thanks for this one @MarcoGorelli. I do agree it is an improvement to answer the query. :)

@ritchie46 ritchie46 merged commit 7cd4e92 into pola-rs:main May 20, 2024
27 checks passed
@MarcoGorelli
Copy link
Collaborator Author

yay, thank you so much, and thanks @orlp for helping me - been wanting to see this happen for a while 🥳 I think users will definitely appreciate it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants