Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: np.asarray return a copy with shared memory #24478

Open
Itayazolay opened this issue Aug 21, 2023 · 15 comments
Open

BUG: np.asarray return a copy with shared memory #24478

Itayazolay opened this issue Aug 21, 2023 · 15 comments
Labels

Comments

@Itayazolay
Copy link

Describe the issue:

According to np.asarray(a) docs, it may return the original array or a copy.
In the case below, while np.asarray(a, dtype='object') is not a, - np.shares_memory(np.asarray(a, dtype='object'), a) given that a is copied.

Reproduce the code example:

import numpy as np
import pickle
t = np.array([None, None, 123], dtype='object')
tp = pickle.loads(pickle.dumps(t))
g = np.asarray(tp, dtype='object')

assert g is not tp
assert np.shares_memory(g, tp)  # Unexpected
g[0] = '1'
assert tp[0] == '1'  # Unexpected


# The issue seems to be originated here
assert not np.dtype('O') is pickle.loads(pickle.dumps(np.dtype('O'))) 
assert np.dtype('O') == pickle.loads(pickle.dumps(np.dtype('O')))

Error message:

No response

Runtime information:

1.25.2
3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0]
[{'numpy_version': '1.25.2',
'python': '3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) '
'[GCC 11.3.0]',
'uname': uname_result(system='Linux', node='itay-jether', release='6.2.0-26-generic', version='#26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2', machine='x86_64')},
{'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
'found': ['SSSE3',
'SSE41',
'POPCNT',
'SSE42',
'AVX',
'F16C',
'FMA3',
'AVX2',
'AVX512F',
'AVX512CD',
'AVX512_SKX',
'AVX512_CLX',
'AVX512_CNL',
'AVX512_ICL'],
'not_found': ['AVX512_KNL', 'AVX512_KNM']}},
{'architecture': 'SkylakeX',
'filepath': '/home/itay/miniforge3/envs/jpy310/lib/python3.10/site-packages/numpy.libs/libopenblas64_p-r0-5007b62f.3.23.dev.so',
'internal_api': 'openblas',
'num_threads': 8,
'prefix': 'libopenblas',
'threading_layer': 'pthreads',
'user_api': 'blas',
'version': '0.3.23.dev'},
{'architecture': 'SkylakeX',
'filepath': '/home/itay/miniforge3/envs/jpy310/lib/python3.10/site-packages/scipy.libs/libopenblasp-r0-41284840.3.18.so',
'internal_api': 'openblas',
'num_threads': 8,
'prefix': 'libopenblas',
'threading_layer': 'pthreads',
'user_api': 'blas',
'version': '0.3.18'},
{'architecture': 'SkylakeX',
'filepath': '/home/itay/miniforge3/envs/jpy310/lib/python3.10/site-packages/cvxopt.libs/libopenblasp-r0-5c2b7639.3.23.so',
'internal_api': 'openblas',
'num_threads': 8,
'prefix': 'libopenblas',
'threading_layer': 'pthreads',
'user_api': 'blas',
'version': '0.3.23'},
{'architecture': 'Prescott',
'filepath': '/home/itay/miniforge3/envs/jpy310/lib/python3.10/site-packages/scs.libs/libopenblas-r0-f650aae0.3.3.so',
'internal_api': 'openblas',
'num_threads': 1,
'prefix': 'libopenblas',
'threading_layer': 'disabled',
'user_api': 'blas',
'version': None},
{'filepath': '/home/itay/miniforge3/envs/jpy310/lib/python3.10/site-packages/scikit_learn.libs/libgomp-a34b3233.so.1.0.0',
'internal_api': 'openmp',
'num_threads': 8,
'prefix': 'libgomp',
'user_api': 'openmp',
'version': None}]

Context for the issue:

I'm using pandas and have pickled dataframes, when using df.astype(str) the dataframe is changed in-place
see issue in pandas

@rkern
Copy link
Member

rkern commented Aug 21, 2023

FWIW, in 1.23.5, g is tp as expected. Looking at git blame, I wonder if #23404 might be responsible. Another difference is how g.dtype is tp.dtype operates in each of the versions; in 1.23.5, that is also True but in 1.25.1, that is False; that is, in 1.25.1, roundtripping an object array through pickle doesn't preserve the identity of the dtype object.

That said, the examples suggest a little more than what the rest of the docstring claims. The docstring doesn't claim that the given object is returned, though this is often the case. It only claims that no copy (implicitly, "of the data") is performed if none is needed, given the other arguments. And that's what happens here. A new ndarray instance is created, but it points to the same data. It seems like pandas is testing if two arrays are copies of the data by testing the identity of the ndarray objects themselves, and that's never been a valid test.

@rkern
Copy link
Member

rkern commented Aug 21, 2023

I think it's worth adjusting our examples, at least, to not suggest that g is tp is a valid way to test if the data is copied, and it might be worth clarifying our statement about "no copy" in the body of the docstring to specifically call out that it will not copy the data, but might create a new array object.

@ngoldbaum
Copy link
Member

I think it probably makes sense for us to try to make the pickle behavior more sensible, I would naively expect np.dtype(“object”) to be a singleton and there seems to be code that was relying on it. Let’s see if we can bisect this across a build system change…

@ngoldbaum
Copy link
Member

ngoldbaum commented Aug 21, 2023

I bisected this to #21995. I suspect that pickle is going through the code path that calls NewFromDescrAndBase. Maybe we could check for a singleton first?

@ngoldbaum
Copy link
Member

Ah no, not quite right. The behavior of getting back a dtype object that is not the same as the object dtype singleton when unpickling an object array is older, #21995 just made it so this older behavior led to a new array object getting created.

@jacobmcasey
Copy link
Contributor

Hey,

I'll be diving into this np.asarray pickle issue tomorrow.

Looks like the crux might be with how np.dtype('O') is pickled/unpickled and its interaction with np.asarray.

I'll start by isolating the problem in the code and running some tests, see what's going on under the hood.

Any quick pointers before I jump in?

Cheers,
Jacob

@ngoldbaum
Copy link
Member

ngoldbaum commented Aug 27, 2023

You'll want to look at how dtype objects are pickled and unpickled. The dtype object (what numpy internally calls a descriptor) is defined in numpy/core/src/multiarray/descriptor.c. The pickling operation is handled by arraydecr_reduce and arraydescr_setstate. Looking at the setstate implementation, right now unpickling always builds a new dtype object from the content of the pickle. It may be possible to detect when the pickled dtype is exactly the same as a built-in singleton dtype like object and re-use the singleton instead of the partially reconstructed dtype, I don't know offhand if that's tricky to implement.

@jacobmcasey
Copy link
Contributor

Thanks, so I have been doing a bit of digging. When unpickling a NumPy array, is there a mechanism in place to ensure that if the dtype of the array corresponds to one of the built-in singleton dtype objects, the singleton is reused instead of creating a new instance? I'm concerned about the potential memory and performance overhead of having redundant dtype objects in memory as this is likely where it is coming from.

If not these are a few ideas that may work?

Object Signature Matching: This would involve comparing the signature of the unpickled dtype with built-in singleton dtypes and reusing if a match is found.
Global dtype Cache: A cache of frequently-used dtypes could be maintained and consulted during the unpickling process.
Pickle Metadata Annotation: We could modify the pickling process to include specific metadata for dtypes, which can then guide the unpickling process.
Enhancing the reduce and setstate methods: Modifying these methods for dtype objects might allow us to better handle the reuse of built-in dtypes during unpickling. Given that this is TODO in the NumPy numpy/core/src/multiarray/descriptor.c this could be the first attempt.

But would be great to get some wider opinions on the best strategy before I try any of them

Best,
Jacob

@rkern
Copy link
Member

rkern commented Sep 14, 2023

Honestly, I think the best strategy is to fix the examples in the docstring. No one should be using is to test whether or not asarray() had to make a copy of the data. It's never guaranteed. Expending a lot of effort to get this one case fixed is just obscuring the other cases that will inevitably arise. It still won't enable anyone to write code using is to do that test.

@ngoldbaum
Copy link
Member

I agree about the is test for ndarray comparison. I think doing is checks for dtype comparisons is slightly more defensible, but as this issue points out there are corner cases and it's probably not safe either. I wouldn't object to a change to the dtype pickling code so that singletons get returned in sensible contexts but I also don't think it's worth spending a lot of time on if your ultimate goal is making is comparisons work more sensibly.

@jacobmcasey
Copy link
Contributor

Would a note like "*Note: The use of the is operator in the examples above is for illustrative purposes only. In practice, the behavior of np.asarray() in terms of copying is not always guaranteed and can depend on various factors. Therefore, one should avoid using is to determine if np.asarray() made a copy of the data." suffice? Or do we need more complex examples?

@rkern
Copy link
Member

rkern commented Sep 15, 2023

It's more informative and less verbose to change the examples themselves to show the right thing. The right way to do these checks is not more complicated.

@andyfaff
Copy link
Contributor

I'd like to point out that the data-api standard specifies a copy keyword which should guarantee a copy. Is numpy conformant to the data-api specification?

@rkern
Copy link
Member

rkern commented Sep 15, 2023

No, but that's not relevant to this reported issue.

@jacobmcasey
Copy link
Contributor

Based on our discussion here, I've made changes to the asarray docstring. I have submitted a PR addressing this issue: #24714.

Any feedback or further suggestions would be appreciated!

Best,
Jacob

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants