Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: reading NaT/NaN on M1 ARM chip #6191

Closed
philippemiron opened this issue Jan 25, 2022 · 8 comments · Fixed by #7827
Closed

[Bug]: reading NaT/NaN on M1 ARM chip #6191

philippemiron opened this issue Jan 25, 2022 · 8 comments · Fixed by #7827

Comments

@philippemiron
Copy link

philippemiron commented Jan 25, 2022

What happened?

I have nan values in a date vector stored in a netCDF. When I read on my ARM Apple computer with xr.open_dataset(), it is not properly recognized.

For example, the following data is stored in a NetCDF:

date = pd.date_range(...)
date[4] = nan

Then when I read the file:
date[4] is set to date[0], which is the first date of the range instead of a 'NaT'.

I understand that this issue is quite weird and it doesn't seem to happen on other OS. Actually, I try on MacOS (with an intel processor) and on two different Linux computers, and in those configurations, date[4] is properly set to 'NaT' after opening the netCDF with xr.open_dataset(). Note that I tried with the same version of xarray as well as with different versions, and I just can't seem to reproduce this issue on any machine except on the M1 ARM chip.

What did you expect to happen?

I expect the following result after running the minimal example:

array(['2022-01-01T00:00:00.000000000', '2022-01-02T00:00:00.000000000',
       '2022-01-03T00:00:00.000000000', '2022-01-04T00:00:00.000000000',
                                 'NaT', '2022-01-06T00:00:00.000000000',
       '2022-01-07T00:00:00.000000000', '2022-01-08T00:00:00.000000000',
       '2022-01-09T00:00:00.000000000', '2022-01-10T00:00:00.000000000'],
      dtype='datetime64[ns]')

Minimal Complete Verifiable Example

import xarray as xr
import pandas as pd
import numpy as np

time = pd.date_range(start="2022-01-01",end="2022-01-10").to_pydatetime()
time[4] = np.datetime64("NaT")

ds = xr.Dataset(
    data_vars=dict(
        time=(["nt"], time),
    ),
)
ds.to_netcdf('test.nc')

ds_r = xr.open_dataset('test.nc')
ds_r.time

Relevant log output

array(['2022-01-01T00:00:00.000000000', '2022-01-02T00:00:00.000000000',
       '2022-01-03T00:00:00.000000000', '2022-01-04T00:00:00.000000000',
       '2022-01-01T00:00:00.000000000', '2022-01-06T00:00:00.000000000',
       '2022-01-07T00:00:00.000000000', '2022-01-08T00:00:00.000000000',
       '2022-01-09T00:00:00.000000000', '2022-01-10T00:00:00.000000000'],
      dtype='datetime64[ns]')

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.10.1 | packaged by conda-forge | (main, Dec 22 2021, 01:38:36) [Clang 11.1.0 ]
python-bits: 64
OS: Darwin
OS-release: 21.2.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.1
libnetcdf: 4.8.1

xarray: 0.20.2
pandas: 1.3.5
numpy: 1.21.5
scipy: 1.7.3
netCDF4: 1.5.8
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.5.1.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2021.12.0
distributed: 2021.12.0
matplotlib: 3.5.1
cartopy: 0.20.1
seaborn: None
numbagg: None
fsspec: 2021.11.1
cupy: None
pint: None
sparse: None
setuptools: 60.0.4
pip: 21.3.1
conda: None
pytest: None
IPython: 8.0.0
sphinx: None

@philippemiron philippemiron added bug needs triage Issue that has not been reviewed by xarray team member labels Jan 25, 2022
@philippemiron philippemiron changed the title [Bug]: NaT/NaN not recognize on M1 computer [Bug]: reading NaT/NaN on M1 ARM chip Jan 25, 2022
@max-sixty
Copy link
Collaborator

Thanks @philippemiron .

My guess is that this is an issue with an underlying library, since xarray doesn't generally do these operations in its code. Do you know if there are any similar issues in libnetcdf or netCDF4?

(Others know more than me about these libraries, so please feel free to interject)

@philippemiron
Copy link
Author

So far I haven't spotted any other issues with libnetcdf.

@max-sixty
Copy link
Collaborator

I tried reproducing on an M1 Mac, but my install of python seems to report that it's on an x86_64 (version='Darwin Kernel Version 21.0.1: Tue Sep 14 20:56:24 PDT 2021 ; root:xnu-8019.30.61~4/RELEASE_ARM64_T6000', machine='x86_64'). It didn't reproduce, unsurprisingly.

Does uninstalling netCDF4 help? That would isolate it to that library and its dependencies.

@philippemiron
Copy link
Author

philippemiron commented Jan 26, 2022

I'm actually using miniforge which natively supports ARM64. Uninstalling netCDF4 does not fix the issue. And actually, opening the same file as follow:

from netCDF4 import Dataset
f = Dataset('test.nc')
f['time'][:]

gives the expected results (dates are not recognized but the nan is there):

masked_array(data=[0.0, 1.0, 2.0, 3.0, --, 5.0, 6.0, 7.0, 8.0, 9.0],
             mask=[False, False, False, False,  True, False, False, False,
                   False, False],
       fill_value=nan)

@max-sixty
Copy link
Collaborator

I sorted out my M1 python installation and can reproduce:

In [21]: ds_r.time
Out[21]:
<xarray.DataArray 'time' (nt: 10)>
array(['2022-01-01T00:00:00.000000000', '2022-01-02T00:00:00.000000000',
       '2022-01-03T00:00:00.000000000', '2022-01-04T00:00:00.000000000',
       '2022-01-01T00:00:00.000000000', '2022-01-06T00:00:00.000000000',  # Note the first value on this line!
       '2022-01-07T00:00:00.000000000', '2022-01-08T00:00:00.000000000',
       '2022-01-09T00:00:00.000000000', '2022-01-10T00:00:00.000000000'],
      dtype='datetime64[ns]')
Dimensions without coordinates: nt

It's quite surprising we get '2022-01-01T00:00:00.000000000' rather than NaT — why the beginning of the year?!

I suspect it's not directly an xarray issue given Xarray is only python code, and python code does not directly branch by CPU.
I've frequently had issues like this where it's difficult to understand which library is responsible, I'd welcome any more investigation here.

@max-sixty max-sixty added needs review upstream issue and removed needs triage Issue that has not been reviewed by xarray team member bug labels May 21, 2022
@philippemiron
Copy link
Author

philippemiron commented May 23, 2022

It is replaced by the first value of the array. If you change to:

time = pd.date_range(start="2022-01-02",end="2022-01-11").to_pydatetime()

the NaT is replaced by '2022-01-02T00:00:00.000000000'. Maybe it is stored internally as a time_origin and some time_delta, and the NaT are replaced by 0?

@DocOtak
Copy link
Contributor

DocOtak commented Aug 9, 2022

I got caught by this one yesterday on an M1 machine. I did some digging and found what I think to be the underlying issue. The short explanation is that the time conversion functions do an astype(np.int64) or equivalent cast on arrays that contain nans. This is undefined behavior and very soon, doing this will start to emit RuntimeWarnings.

I knew from my own data files that it wasn't the first element of the array being substituted but whatever was in the units as the epoch. I started to poke at the xarray internals (and the CFtime internals) to try to get a minimal example working, eventually found the following:

On an M1:

>>> from xarray.coding.times import _decode_datetime_with_pandas
>>> import numpy as np
>>> _decode_datetime_with_pandas(np.array([20000, float('nan')]),  "days since 1950-01-01", "proleptic_gregorian")
array(['2004-10-04T00:00:00.000000000', '1950-01-01T00:00:00.000000000'],
      dtype='datetime64[ns]')
>>> np.array(np.nan).astype(np.int64)
array(0)

On an x86_64:

>>> from xarray.coding.times import _decode_datetime_with_pandas
>>> import numpy as np
>>> _decode_datetime_with_pandas(np.array([20000, float('nan')]),  "days since 1950-01-01", "proleptic_gregorian")
array(['2004-10-04T00:00:00.000000000',                           'NaT'],
      dtype='datetime64[ns]')
>>> np.array(np.nan).astype(np.int64)
array(-9223372036854775808)

This issue is not Apple/M1/clang specific, I tested on an aws graviton (arm) instance and got the same results with ubuntu/gcc:

Python 3.10.4 (main, Jun 29 2022, 12:14:53) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from xarray.coding.times import _decode_datetime_with_pandas
>>> import numpy as np
>>> _decode_datetime_with_pandas(np.array([20000, float('nan')]),  "days since 1950-01-01", "proleptic_gregorian")
array(['2004-10-04T00:00:00.000000000', '1950-01-01T00:00:00.000000000'],
      dtype='datetime64[ns]')
>>> np.array(np.nan).astype(np.int64)
array(0)

Here is where the cast is happening on the internal xarray implementation, CFtime has similar casts in its implementation.

flat_num_dates_ns_int = (flat_num_dates * _NS_PER_TIME_DELTA[delta]).astype(
np.int64
)

@DocOtak
Copy link
Contributor

DocOtak commented Aug 9, 2022

Some additional info for when how to figure out the best way to address this.

For the decode using pandas approach, two things I tried worked: using a pandas.array with a nullable integer data type, or simulating what happens on x86_64 systems by checking for nans in the incoming array and setting those positions to numpy.iinfo(np.int64).min.

the pandas nullable integer array:

    # note that is a capital i Int64 to use the nullable type.
    flat_num_dates_ns_int = pd.array(flat_num_dates * _NS_PER_TIME_DELTA[delta], dtype="Int64")

simulate x86:

    flat_num_dates_ns_int = (flat_num_dates * _NS_PER_TIME_DELTA[delta]).astype(
        np.int64
    )

    flat_num_dates_ns_int[np.isnan(flat_num_dates)] = np.iinfo(np.int64).min

The pandas solution is explicitly experimental in their docs, and the emulate version just feels "hacky" to me. These don't break any existing tests on my local machine.

cftime itself has no support for nan type missing values and will fail:

(on x86_64)

>>> import numpy as np
>>> from xarray.coding.times import decode_cf_datetime
>>> decode_cf_datetime(np.array([0, np.nan]), "days since 1950-01-01", use_cftime=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/abarna/.pyenv/versions/3.8.5/lib/python3.8/site-packages/xarray/coding/times.py", line 248, in decode_cf_datetime
    dates = _decode_datetime_with_cftime(flat_num_dates, units, calendar)
  File "/home/abarna/.pyenv/versions/3.8.5/lib/python3.8/site-packages/xarray/coding/times.py", line 164, in _decode_datetime_with_cftime
    cftime.num2date(num_dates, units, calendar, only_use_cftime_datetimes=True)
  File "src/cftime/_cftime.pyx", line 484, in cftime._cftime.num2date
TypeError: unsupported operand type(s) for +: 'cftime._cftime.DatetimeGregorian' and 'NoneType'

cftime is happy with masked arrays:

>>> import cftime
>>> a1 = np.ma.masked_invalid(np.array([0, np.nan]))
>>> cftime.num2date(a1, "days since 1950-01-01")
masked_array(data=[cftime.DatetimeGregorian(1950, 1, 1, 0, 0, 0, 0), --],
             mask=[False,  True],
       fill_value='?',
            dtype=object)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants