fix: prevent exponential memory growth in UnionArray #3119

jpivarski · 2024-05-14T22:15:44Z

I decided to just make UnionArray.simplified not allow lazy carries. It should always return correct results and it can only have a performance impact on wide RecordArrays that go through UnionArray.simplified (perhaps via ak.concatenate).

The new test would fail for the old code: for 5 loops, the NumpyArray length gets up to 32, and we assert <= 2.

I only had to change one test, which was hard-coded to a particular Form. It was for testing allow_noncanonical_form=True in ak.from_buffers. I think that feature is still being tested with the updated test.

@agoose77, am I missing anything here? Would you do anything differently?

ianna

@jpivarski - it looks like the tests on Mac are segfaulting and not only for this PR. I'm checking locally.

agoose77 · 2024-05-15T11:49:14Z

@jpivarski I agree with your reasoning. I haven't given much more thought to alternative solutions, but we can always revisit this in future!

ianna · 2024-05-15T16:18:38Z

@jpivarski - it looks like the tests on Mac are segfaulting and not only for this PR. I'm checking locally.

I can confirm that I can reproduce the segfault locally on MacOS 11.6 with pyarrow 16.1.0 (released yesterday). It happens in tests/test_0080_flatpandas_multiindex_rows_and_columns.py

Changing back to pyarrow 16.0.0 fixes the issue.

jpivarski · 2024-05-15T22:18:46Z

Although we could hold back pyarrow<16.1.0, that would have to be temporary; we'd have to remove the constraint after pyarrow gets fixed. We can only expect pyarrow to get fixed if the error is reported to them, so I tried to do that, but #3122.

ianna · 2024-05-16T10:54:35Z

Although we could hold back pyarrow<16.1.0, that would have to be temporary; we'd have to remove the constraint after pyarrow gets fixed. We can only expect pyarrow to get fixed if the error is reported to them, so I tried to do that, but #3122.

I uninstalled both awkward and awkward-cpp and have a minimum reproducer of a segfault with:

% python
Python 3.11.5 (main, Sep 11 2023, 08:19:27) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
zsh: segmentation fault  python

The test_pyarrow.py has only one line:
import pyarrow

 % lldb python test_pyarrow.py 
(lldb) target create "python"
Current executable set to '/Users/yana/anaconda3/bin/python' (x86_64).
(lldb) settings set -- target.run-args  "test_pyarrow.py"
(lldb) run
Process 34536 launched: '/Users/yana/anaconda3/bin/python' (x86_64)
Process 34536 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x0000000129c8685e libarrow.1601.dylib`malloc_conf_init_helper + 302
libarrow.1601.dylib`malloc_conf_init_helper:
->  0x129c8685e <+302>: movzbl (%rbx), %eax
    0x129c86861 <+305>: testb  %al, %al
    0x129c86863 <+307>: je     0x129c8936e               ; <+11326>
    0x129c86869 <+313>: movq   %rbx, %rcx
(lldb)

jpivarski · 2024-05-16T12:15:27Z

That reproducer must depend on the environment because it didn't segfault immediately on import for me. I also noticed in the output that this is x86-64. My laptop is ARM (M2). That's likely relevant.

Could you send as much information as you can to the Arrow developers? Since we also need a short-term solution, we can put a version cap on Arrow in our tests for now. There are a few pull requests that need it.

ianna · 2024-05-16T15:47:50Z

That reproducer must depend on the environment because it didn't segfault immediately on import for me. I also noticed in the output that this is x86-64. My laptop is ARM (M2). That's likely relevant.

Could you send as much information as you can to the Arrow developers? Since we also need a short-term solution, we can put a version cap on Arrow in our tests for now. There are a few pull requests that need it.

I've opened an issue apache/arrow#41696

jpivarski · 2024-05-16T17:27:45Z

Awesome! It looks like they're dealing with it right away.

…in_unionarray

fix: prevent exponential memory growth in UnionArray

698b015

jpivarski requested a review from agoose77 May 14, 2024 22:15

jpivarski linked an issue May 14, 2024 that may be closed by this pull request

Exponential memory growth in UnionArray broadcasting #3118

Closed

jpivarski temporarily deployed to docs May 14, 2024 22:28 — with GitHub Actions Inactive

ianna reviewed May 15, 2024

View reviewed changes

jpivarski mentioned this pull request May 15, 2024

Is awkward-cpp not linking properly in MacOS now? #3122

Closed

Merge branch 'main' into jpivarski/prevent_exponential_memory_growth_…

78b741b

…in_unionarray

jpivarski deployed to docs May 28, 2024 14:00 — with GitHub Actions View deployment

jpivarski merged commit 28a89da into main May 28, 2024
41 checks passed

jpivarski deleted the jpivarski/prevent_exponential_memory_growth_in_unionarray branch May 28, 2024 14:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent exponential memory growth in UnionArray #3119

fix: prevent exponential memory growth in UnionArray #3119

jpivarski commented May 14, 2024

ianna left a comment

agoose77 commented May 15, 2024

ianna commented May 15, 2024 •

edited

jpivarski commented May 15, 2024

ianna commented May 16, 2024 •

edited

jpivarski commented May 16, 2024

ianna commented May 16, 2024

jpivarski commented May 16, 2024

fix: prevent exponential memory growth in UnionArray #3119

fix: prevent exponential memory growth in UnionArray #3119

Conversation

jpivarski commented May 14, 2024

ianna left a comment

Choose a reason for hiding this comment

agoose77 commented May 15, 2024

ianna commented May 15, 2024 • edited

jpivarski commented May 15, 2024

ianna commented May 16, 2024 • edited

jpivarski commented May 16, 2024

ianna commented May 16, 2024

jpivarski commented May 16, 2024

ianna commented May 15, 2024 •

edited

ianna commented May 16, 2024 •

edited