Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Expose string expression nodes to python #16221

Merged
merged 5 commits into from
May 16, 2024

Conversation

brandon-b-miller
Copy link
Contributor

Hello! First of all, thanks for polars!

This PR follows up recent work to expose the logical plan to python by adding string expressions. This should allow a consumer of the plan on the python side to implement accelerated string computations.

cc @wence-

@github-actions github-actions bot added the enhancement New feature or an improvement of an existing feature label May 14, 2024
@brandon-b-miller brandon-b-miller changed the title feat(python) Expose string expression nodes to python feat(python): Expose string expression nodes to python May 14, 2024
@github-actions github-actions bot added the python Related to Python Polars label May 14, 2024
@ritchie46
Copy link
Member

Yeap, seems good. Left one comment to shrink the match.

StringFunction::ConcatHorizontal {
delimiter,
ignore_nulls,
} => ("concathorizontal", delimiter, ignore_nulls).to_object(py),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question I have here: we're effectively using a tuple with a string as a poor approximation of a tagged union type (AKA, the incoming enum type on the rust side).

This means that in Python it's not easily to introspect the set of values that we can receive here.

One alternative option (which is more typing) is to have an enum class here that is exposed. This could either be a unit variant enum that replaces the use of the string "tag" with an enum value:

#[pyclass]
enum PyStringFunction {
  ConcatHorizontal,
  ConcatVertical,
  ...
}
...

match strfun {
   StringFunction::ConcatHorizontal { delimiter, ignore_nulls } => (PyStringFunction, delimiter, ignore_nulls).to_object(py),
   ...
}

Or, I think, using the new features of pyo3 for struct variant enums:

#[pyclass]
enum PyStringFunction {
   ConcatHorizontal { delimiter: Whatever, ignore_nulls: bool },
   ...
}

...
match strfn {
   StringFunction::ConcatHorizontal { d, i } => PyStringFunction::ConcatHorizontal {d, i},
   ...
}

AFAICT, from some (very minimal) microbenchmarking, this all have pretty much the same performance characteristics.

One niggle for the last option is that (for now) you can't mix unit an struct enum variants, and tuple enum variants aren't supported at all, so you can't write:

#[pyclass]
enum SomeEnum {
   Unit,
   Tuple(i32),
   Struct { a: i32, b: i64 },
}

Instead one needs to map every to (possibly empty) structs and come up with a name for the unnamed tuple variant field.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like that more as we can more easily change our internals whilst maintaining the same exposed enums. It is some extra typing atm, but it gives us an extra indirection that will likely make it more maintainable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added something in da70f23 that hopefully moves us in this direction 👍

infer_schema_len,
} => (
"jsondecode",
Wrap(dtype.as_ref().unwrap().clone()).to_object(py),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (for @ritchie46): are these dtypes that live inside the variants different from the ones that AExpr.to_field(...) would return?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should always trust AExpr::to_field. That this variant holds a dtype is an implementation detail. Those details are now leaked and might be changed. So be aware of that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could therefore just not leak these out: it makes the mapping not one-to-one, but if to_field is the correct source of truth then it makes sense to not lead someone down a bad path by providing these slots in the first place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agree.

Copy link

codecov bot commented May 14, 2024

Codecov Report

Attention: Patch coverage is 1.21951% with 81 lines in your changes are missing coverage. Please review.

Project coverage is 80.82%. Comparing base (f6b4f48) to head (e998135).
Report is 21 commits behind head on main.

Files Patch % Lines
py-polars/src/lazyframe/visitor/expr_nodes.rs 0.00% 81 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16221      +/-   ##
==========================================
- Coverage   80.99%   80.82%   -0.18%     
==========================================
  Files        1392     1394       +2     
  Lines      178930   179569     +639     
  Branches     2904     2913       +9     
==========================================
+ Hits       144920   145130     +210     
- Misses      33507    33936     +429     
  Partials      503      503              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ritchie46 ritchie46 merged commit 98a2d9b into pola-rs:main May 16, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants