-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generalized ufuncs that return arrays are broken in a variety of cases #14811
Comments
sounds reasonable - have you already looked into raising a PR for this? |
@MarcoGorelli no, I have two PRs on ufuncs already (docs + sketch of partial fix that requires design feedback) and want to get through those first. |
And now I see there's a duplicate PR that got opened in interim :/ |
I can do this first if you think it's a prerequisite for docs improvements. |
I just encountered some odd behaviour which may fall under this category. import polars as pl
from numba import guvectorize, int64
@guvectorize([(int64[:], int64[:])], "(n)->(n)")
def gufunc_example(col, result):
for i in range(len(col)):
print("[DEBUG]:", col[i])
result[i] = col[i] df = pl.DataFrame({"a": [12, 13, 14, 15, 16]}).with_row_index()
df = df.with_columns(pl.when(pl.col("index") < 3).then(pl.all())) >>> df
shape: (5, 2)
┌───────┬──────┐
│ index ┆ a │
│ --- ┆ --- │
│ u32 ┆ i64 │
╞═══════╪══════╡
│ 0 ┆ 12 │
│ 1 ┆ 13 │
│ 2 ┆ 14 │
│ null ┆ null │
│ null ┆ null │
└───────┴──────┘ Inside the ufunc, we seem to have access to the original "state": >>> out = df.with_columns(b = pl.col("a").map_batches(gufunc_example))
[DEBUG]: 12
[DEBUG]: 13
[DEBUG]: 14
[DEBUG]: 15 # <- ???
[DEBUG]: 16 # <- ??? |
That's a characteristic of how when/then works, unrelated to ufuncs. Anytime you do a when/then, it gives all the data to every then/otherwise and the when condition just chooses which result goes where. |
@deanm0000 But the when/then was used in a previous step to modify the dataframe - so the values are gone. df = df.with_columns(pl.when(pl.col("index") < 3).then(pl.all())) |
oh whoops I guess I should read more carefully. Arrow memory stores null state as a single bit separate from the data itself while leaving any arbitrary value in the value space. In this case we started with a good value and made it null so it doesn't do anything with the value itself. Since it's meant to be arbitrary it just changes the null state. Here's a different example:
There are warnings in the docs about numpy handling missing data differently and this is a manifestation of that. |
Yeah, I think I can now understand the problem - thanks. So if you start with missing data, you get import polars as pl
from numba import guvectorize, int64
@guvectorize([(int64[:], int64[:])], "(n)->(n)")
def gufunc_example(col, result):
for i in range(len(col)):
result[i] = col[2]
df = pl.DataFrame({"a": [12, 13, None, 15, 16]}).with_row_index()
df.with_columns(b = pl.col("a").map_batches(gufunc_example))
# shape: (5, 3)
# ┌───────┬──────┬──────┐
# │ index ┆ a ┆ b │
# │ --- ┆ --- ┆ --- │
# │ u32 ┆ i64 ┆ i64 │
# ╞═══════╪══════╪══════╡
# │ 0 ┆ 12 ┆ 0 │
# │ 1 ┆ 13 ┆ 0 │
# │ 2 ┆ null ┆ null │
# │ 3 ┆ 15 ┆ 0 │
# │ 4 ┆ 16 ┆ 0 │
# └───────┴──────┴──────┘ But if missing data is introduced via "expressions", you still get the original value: df = pl.DataFrame({"a": [12, 13, 14, 15, 16]}).with_row_index()
df = df.with_columns(pl.when(pl.col("index") != 2).then(pl.all()))
df.with_columns(b = pl.col("a").map_batches(gufunc_example))
# shape: (5, 3)
# ┌───────┬──────┬──────┐
# │ index ┆ a ┆ b │
# │ --- ┆ --- ┆ --- │
# │ u32 ┆ i64 ┆ i64 │
# ╞═══════╪══════╪══════╡
# │ 0 ┆ 12 ┆ 14 │
# │ 1 ┆ 13 ┆ 14 │
# │ null ┆ null ┆ null │
# │ 3 ┆ 15 ┆ 14 │
# │ 4 ┆ 16 ┆ 14 │
# └───────┴──────┴──────┘ |
I will try to do this soon, since it really is an easy pitfall to hit and blocking an already-written docs PR. |
OK, starting work on this. |
Per @MarcoGorelli's request, going to restrict this just to the first case, missing data. The other issue is here: #16193 |
This is a follow-up to #14748.
Checks
Reproducible example
Log output
Issue description
The current
ufunc
support essentially assumes justufunc
, and can't handle the variety of edge cases generalizedufunc
s add:ufunc
s. I suspect there is an edge case where different output size can result in memory corruption...There are also potentially edge cases in the signature syntax that can be rejected early rather than late, with a more informative error message, e.g. multi-dimensional output maybe.
Expected behavior
This is an open question. Here is what I would expect:
ValueError("There is missing data in in your Series, which can result in generalized ufuncs giving wrong results")
.Series
of the correct size. The reason identical input and output size is assumed right now is (a) because it's always true for normalufunc
and (b) to deal with missing data (c) it's a nice optimization. However, the code could be changed to allow theufunc
to allocate the output array, and then that data copied into aSeries
.Note that that both these proposals are for generalized
ufunc
only. Regularufunc
would continue as normal.ufunc
ufunc
runs on individual values anywayufunc
Installed versions
The text was updated successfully, but these errors were encountered: