Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute data buffer length by using start and end values in offset buffer #5756

Closed
viirya opened this issue May 11, 2024 · 0 comments · Fixed by #5741
Closed

Compute data buffer length by using start and end values in offset buffer #5756

viirya opened this issue May 11, 2024 · 0 comments · Fixed by #5741
Labels

Comments

@viirya
Copy link
Member

viirya commented May 11, 2024

Describe the bug

Encountered an issue when importing empty variable-size binary layout array (e.g., string) from Java Arrow.

There is difference between Java Arrow and arrow-rs when computing the length of data buffer: apache/arrow#41610 (comment)

This is how Java Arrow imports an Utf8 array:

try (ArrowBuf offsets = importOffsets(type, VarCharVector.OFFSET_WIDTH)) {
      final int start = offsets.getInt(0);
      final int end = offsets.getInt(fieldNode.getLength() * (long) VarCharVector.OFFSET_WIDTH);
      final int len = end - start;
      ...
}

So even the offset buffer is not initialized, for empty array with one element offset buffer, end - start is always 0 that is the length of data buffer. That is why the added roundtrip tests are passed.

But in arrow-rs, it takes the last value of the offset buffer as the length of data buffer, i.e., end. If the value is not initialized to zero, the computed length of data buffer is incorrect.

That is what I found for the first offset value from the spec:

Generally the first slot in the offsets array is 0, and the last slot is the length of the values array.
When serializing this layout, we recommend normalizing the offsets to start at 0.

It looks like the first value doesn't have to be 0, although generally it is. So seems Java Arrow's approach is (more) correct.

To Reproduce

Expected behavior

Additional context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant