Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

if_else bug when using chunked string array with offsets #41479

Open
0x26res opened this issue May 1, 2024 · 0 comments
Open

if_else bug when using chunked string array with offsets #41479

0x26res opened this issue May 1, 2024 · 0 comments

Comments

@0x26res
Copy link
Contributor

0x26res commented May 1, 2024

Describe the bug, including details regarding any error messages, version, and platform.

I have a chunked array made of view/slices of the same array.

When I call if_else on that array, the results are wrong and it can results in strings that are not valid utf-8.

import random

import pyarrow as pa
import pyarrow.compute as pc

sizes = [131072, 57066]
values = ['FOO', "BAR", "HELLO", "WOLRD", ""]

data = pa.array([random.choice(values) for _ in range(sum(sizes))])

inputs = pa.chunked_array(
    [
        data[:sizes[0]],
        data[sizes[1]:]
    ]
)

results = pc.if_else(
    pc.equal(inputs, ""),
    pa.scalar(None, pa.string()),
    inputs,
)

print(pc.unique(results).sort().to_pylist())
# this returns corrupted data, eg:  ['\x00\x00\x00',  '\x00\x00\x00\x00\x00', 'BAR', ...]

For context, I'm loading data from a parquet file, and replacing empty strings with nulls. This started happening when the size of the parquet file increased and data was chunked.

I've tested with pyarrow==16.0.0

Component(s)

C++, Python

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant