Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't read or write h5ad files that contain booleans columns with nulls (None) #1258

Closed
2 of 3 tasks
jkanche opened this issue Dec 12, 2023 · 9 comments
Closed
2 of 3 tasks

Comments

@jkanche
Copy link

jkanche commented Dec 12, 2023

Please make sure these conditions are met

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of anndata.
  • (optional) I have confirmed this bug exists on the master branch of anndata.

Report

Tested this on both anndata version 0.8.0 and the latest 0.10.3 releases.

Code:

import anndata
import numpy as np
import pandas as pd

print(anndata.__version__)
# '0.10.3'

adata = anndata.AnnData(
    X=None,
    obs=pd.DataFrame({
        "test_bool_null": [True, False, None, False],
    }),
)

adata.write_h5ad("test.h5ad")
Traceback:
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 adata.write_h5ad("test.h5ad")

File ~/.local/lib/python3.9/site-packages/anndata/_core/anndata.py:2008, in AnnData.write_h5ad(self, filename, compression, compression_opts, as_dense)
   2005 if filename is None:
   2006     filename = self.filename
-> 2008 write_h5ad(
   2009     Path(filename),
   2010     self,
   2011     compression=compression,
   2012     compression_opts=compression_opts,
   2013     as_dense=as_dense,
   2014 )
   2016 if self.isbacked:
   2017     self.file.filename = filename

File ~/.local/lib/python3.9/site-packages/anndata/_io/h5ad.py:103, in write_h5ad(filepath, adata, as_dense, dataset_kwargs, **kwargs)
    101 elif adata.raw is not None:
    102     write_elem(f, "raw", adata.raw, dataset_kwargs=dataset_kwargs)
--> 103 write_elem(f, "obs", adata.obs, dataset_kwargs=dataset_kwargs)
    104 write_elem(f, "var", adata.var, dataset_kwargs=dataset_kwargs)
    105 write_elem(f, "obsm", dict(adata.obsm), dataset_kwargs=dataset_kwargs)

File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/registry.py:368, in write_elem(store, k, elem, dataset_kwargs)
    344 def write_elem(
    345     store: GroupStorageType,
    346     k: str,
   (...)
    349     dataset_kwargs: Mapping = MappingProxyType({}),
    350 ) -> None:
    351     """
    352     Write an element to a storage group using anndata encoding.
    353 
   (...)
    366         E.g. for zarr this would be `chunks`, `compressor`.
    367     """
--> 368     Writer(_REGISTRY).write_elem(store, k, elem, dataset_kwargs=dataset_kwargs)

File ~/.local/lib/python3.9/site-packages/anndata/_io/utils.py:239, in report_write_key_on_error.<locals>.func_wrapper(*args, **kwargs)
    237         break
    238 try:
--> 239     return func(*args, **kwargs)
    240 except Exception as e:
    241     add_key_note(e, elem, key, "writ")

File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/registry.py:326, in Writer.write_elem(self, store, k, elem, dataset_kwargs, modifiers)
    317     return self.callback(
    318         write_func,
    319         store,
   (...)
    323         iospec=self.registry.get_spec(elem),
    324     )
    325 else:
--> 326     return write_func(store, k, elem, dataset_kwargs=dataset_kwargs)

File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/registry.py:54, in write_spec.<locals>.decorator.<locals>.wrapper(g, k, *args, **kwargs)
     52 @wraps(func)
     53 def wrapper(g, k, *args, **kwargs):
---> 54     result = func(g, k, *args, **kwargs)
     55     g[k].attrs.setdefault("encoding-type", spec.encoding_type)
     56     g[k].attrs.setdefault("encoding-version", spec.encoding_version)

File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/methods.py:683, in write_dataframe(f, key, df, _writer, dataset_kwargs)
    678 _writer.write_elem(
    679     group, index_name, df.index._values, dataset_kwargs=dataset_kwargs
    680 )
    681 for colname, series in df.items():
    682     # TODO: this should write the "true" representation of the series (i.e. the underlying array or ndarray depending)
--> 683     _writer.write_elem(
    684         group, colname, series._values, dataset_kwargs=dataset_kwargs
    685     )

File ~/.local/lib/python3.9/site-packages/anndata/_io/utils.py:239, in report_write_key_on_error.<locals>.func_wrapper(*args, **kwargs)
    237         break
    238 try:
--> 239     return func(*args, **kwargs)
    240 except Exception as e:
    241     add_key_note(e, elem, key, "writ")

File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/registry.py:326, in Writer.write_elem(self, store, k, elem, dataset_kwargs, modifiers)
    317     return self.callback(
    318         write_func,
    319         store,
   (...)
    323         iospec=self.registry.get_spec(elem),
    324     )
    325 else:
--> 326     return write_func(store, k, elem, dataset_kwargs=dataset_kwargs)

File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/registry.py:54, in write_spec.<locals>.decorator.<locals>.wrapper(g, k, *args, **kwargs)
     52 @wraps(func)
     53 def wrapper(g, k, *args, **kwargs):
---> 54     result = func(g, k, *args, **kwargs)
     55     g[k].attrs.setdefault("encoding-type", spec.encoding_type)
     56     g[k].attrs.setdefault("encoding-version", spec.encoding_version)

File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/methods.py:414, in write_vlen_string_array(f, k, elem, _writer, dataset_kwargs)
    412 """Write methods which underlying library handles nativley."""
    413 str_dtype = h5py.special_dtype(vlen=str)
--> 414 f.create_dataset(k, data=elem.astype(str_dtype), dtype=str_dtype, **dataset_kwargs)

File /apps/user/gpy/envs/dev/GPy39/lib/python3.9/site-packages/h5py/_hl/group.py:183, in Group.create_dataset(self, name, shape, dtype, data, **kwds)
    180         parent_path, name = name.rsplit(b'/', 1)
    181         group = self.require_group(parent_path)
--> 183 dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
    184 dset = dataset.Dataset(dsid)
    185 return dset

File /apps/user/gpy/envs/dev/GPy39/lib/python3.9/site-packages/h5py/_hl/dataset.py:168, in make_new_dset(parent, shape, dtype, data, name, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times, external, track_order, dcpl, dapl, efile_prefix, virtual_prefix, allow_unknown_filter, rdcc_nslots, rdcc_nbytes, rdcc_w0)
    165 dset_id = h5d.create(parent.id, name, tid, sid, dcpl=dcpl, dapl=dapl)
    167 if (data is not None) and (not isinstance(data, Empty)):
--> 168     dset_id.write(h5s.ALL, h5s.ALL, data)
    170 return dset_id

File h5py/_objects.pyx:54, in h5py._objects.with_phil.wrapper()

File h5py/_objects.pyx:55, in h5py._objects.with_phil.wrapper()

File h5py/h5d.pyx:280, in h5py.h5d.DatasetID.write()

File h5py/_proxy.pyx:145, in h5py._proxy.dset_rw()

File h5py/_conv.pyx:444, in h5py._conv.str2vlen()

File h5py/_conv.pyx:95, in h5py._conv.generic_converter()

File h5py/_conv.pyx:249, in h5py._conv.conv_str2vlen()

TypeError: Can't implicitly convert non-string objects to strings
Error raised while writing key 'test_bool_null' of <class 'h5py._hl.group.Group'> to /

Similar issue if the file already contains a boolean array with null values (written through the R interface)

anndata.read_h5ad(data_path, backed=True)
Traceback:
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 anndata.read_h5ad("/../GSM4970067.h5ad", backed=True)

File ~/.local/lib/python3.9/site-packages/anndata/_io/h5ad.py:208, in read_h5ad(filename, backed, as_sparse, as_sparse_fmt, chunk_size)
    206         mode = "r+"
    207     assert mode in {"r", "r+"}
--> 208     return read_h5ad_backed(filename, mode)
    210 if as_sparse_fmt not in (sparse.csr_matrix, sparse.csc_matrix):
    211     raise NotImplementedError(
    212         "Dense formats can only be read to CSR or CSC matrices at this time."
    213     )

File ~/.local/lib/python3.9/site-packages/anndata/_io/h5ad.py:151, in read_h5ad_backed(filename, mode)
    148         if k in f:  # Backwards compat
    149             d[k] = read_dataframe(f[k])
--> 151 d.update({k: read_elem(f[k]) for k in attributes if k in f})
    153 d["raw"] = _read_raw(f, attrs={"var", "varm"})
    155 adata = AnnData(**d)

File ~/.local/lib/python3.9/site-packages/anndata/_io/h5ad.py:151, in <dictcomp>(.0)
    148         if k in f:  # Backwards compat
    149             d[k] = read_dataframe(f[k])
--> 151 d.update({k: read_elem(f[k]) for k in attributes if k in f})
    153 d["raw"] = _read_raw(f, attrs={"var", "varm"})
    155 adata = AnnData(**d)

File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/registry.py:341, in read_elem(elem)
    329 def read_elem(elem: StorageType) -> Any:
    330     """
    331     Read an element from a store.
    332 
   (...)
    339         The stored element.
    340     """
--> 341     return Reader(_REGISTRY).read_elem(elem)

File ~/.local/lib/python3.9/site-packages/anndata/_io/utils.py:205, in report_read_key_on_error.<locals>.func_wrapper(*args, **kwargs)
    203         break
    204 try:
--> 205     return func(*args, **kwargs)
    206 except Exception as e:
    207     add_key_note(e, elem, elem.name, "read")

File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/registry.py:251, in Reader.read_elem(self, elem, modifiers)
    249     return self.callback(read_func, elem.name, elem, iospec=get_spec(elem))
    250 else:
--> 251     return read_func(elem)

File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/methods.py:694, in read_dataframe(elem, _reader)
    691 columns = list(_read_attr(elem.attrs, "column-order"))
    692 idx_key = _read_attr(elem.attrs, "_index")
    693 df = pd.DataFrame(
--> 694     {k: _reader.read_elem(elem[k]) for k in columns},
    695     index=_reader.read_elem(elem[idx_key]),
    696     columns=columns if len(columns) else None,
    697 )
    698 if idx_key != "_index":
    699     df.index.name = idx_key

File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/methods.py:694, in <dictcomp>(.0)
    691 columns = list(_read_attr(elem.attrs, "column-order"))
    692 idx_key = _read_attr(elem.attrs, "_index")
    693 df = pd.DataFrame(
--> 694     {k: _reader.read_elem(elem[k]) for k in columns},
    695     index=_reader.read_elem(elem[idx_key]),
    696     columns=columns if len(columns) else None,
    697 )
    698 if idx_key != "_index":
    699     df.index.name = idx_key

File ~/.local/lib/python3.9/site-packages/anndata/_io/utils.py:205, in report_read_key_on_error.<locals>.func_wrapper(*args, **kwargs)
    203         break
    204 try:
--> 205     return func(*args, **kwargs)
    206 except Exception as e:
    207     add_key_note(e, elem, elem.name, "read")

File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/registry.py:251, in Reader.read_elem(self, elem, modifiers)
    249     return self.callback(read_func, elem.name, elem, iospec=get_spec(elem))
    250 else:
--> 251     return read_func(elem)

File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/methods.py:852, in read_nullable_boolean(elem, _reader)
    848 @_REGISTRY.register_read(H5Group, IOSpec("nullable-boolean", "0.1.0"))
    849 @_REGISTRY.register_read(ZarrGroup, IOSpec("nullable-boolean", "0.1.0"))
    850 def read_nullable_boolean(elem, _reader):
    851     if "mask" in elem:
--> 852         return pd.arrays.BooleanArray(
    853             _reader.read_elem(elem["values"]), mask=_reader.read_elem(elem["mask"])
    854         )
    855     else:
    856         return pd.array(_reader.read_elem(elem["values"]))

File /GPy39/lib/python3.9/site-packages/pandas/core/arrays/boolean.py:299, in BooleanArray.__init__(self, values, mask, copy)
    295 def __init__(
    296     self, values: np.ndarray, mask: np.ndarray, copy: bool = False
    297 ) -> None:
    298     if not (isinstance(values, np.ndarray) and values.dtype == np.bool_):
--> 299         raise TypeError(
    300             "values should be boolean numpy array. Use "
    301             "the 'pd.array' function instead"
    302         )
    303     self._dtype = BooleanDtype()
    304     super().__init__(values, mask, copy=copy)

TypeError: values should be boolean numpy array. Use the 'pd.array' function instead
Error raised while reading key '/obs/AIRR_has_ir' of <class 'h5py._hl.group.Group'> to /

Versions

-----
anndata             0.10.3
session_info        1.0.0
-----
anyio                       NA
arrow                       1.3.0
asciitree                   NA
asttokens                   NA
attr                        23.1.0
attrs                       23.1.0
babel                       2.11.0
brotli                      NA
certifi                     2022.12.07
charset_normalizer          2.1.1
cloudpickle                 2.2.1
colorama                    0.4.4
comm                        0.1.3
cr                          NA
cython_runtime              NA
dask                        2023.2.0
dateutil                    2.8.2
debugpy                     1.6.7
decorator                   5.1.1
dill                        0.3.6
dot_parser                  NA
entrypoints                 0.4
exceptiongroup              1.1.1
executing                   1.2.0
fasteners                   0.17.3
fastjsonschema              NA
fqdn                        NA
gmpy2                       2.1.2
google                      NA
h5py                        3.8.0
idna                        3.4
importlib_metadata          NA
ipykernel                   6.27.1
ipython_genutils            0.2.0
isoduration                 NA
jedi                        0.18.2
jinja2                      3.1.2
json5                       NA
jsonpointer                 2.4
jsonschema                  4.20.0
jsonschema_specifications   NA
jupyter_events              0.6.3
jupyter_server              2.6.0
jupyterlab_server           2.25.2
llvmlite                    0.41.1
markupsafe                  2.1.2
mpmath                      1.3.0
msgpack                     1.0.4
natsort                     8.2.0
nbformat                    5.8.0
numba                       0.58.1
numcodecs                   0.11.0
numexpr                     2.8.4
numpy                       1.26.2
overrides                   NA
packaging                   23.0
pandas                      1.5.3
parso                       0.8.3
pexpect                     4.8.0
pickleshare                 0.7.5
pkg_resources               NA
platformdirs                2.6.2
prometheus_client           NA
prompt_toolkit              3.0.41
psutil                      5.9.4
ptyprocess                  0.7.0
pure_eval                   0.2.2
pyarrow                     11.0.0
pydev_ipython               NA
pydevconsole                NA
pydevd                      2.9.5
pydevd_file_utils           NA
pydevd_plugins              NA
pydevd_tracing              NA
pydot                       1.4.2
pygments                    2.14.0
pyparsing                   3.0.9
pythonjsonlogger            NA
pytz                        2022.7.1
referencing                 NA
requests                    2.31.0
rfc3339_validator           0.1.4
rfc3986_validator           0.1.1
rpds                        NA
scipy                       1.10.0
send2trash                  NA
simplejson                  3.18.3
six                         1.16.0
sniffio                     1.3.0
socks                       1.7.1
sparse                      0.14.0
sphinxcontrib               NA
stack_data                  0.6.2
sympy                       1.11.1
tlz                         0.12.0
toolz                       0.12.0
torch                       2.1.1
torchgen                    NA
tornado                     6.2
tqdm                        4.64.1
traitlets                   5.9.0
typing_extensions           NA
unicodedata2                NA
uri_template                NA
urllib3                     1.26.14
wcwidth                     0.2.6
webcolors                   1.13
websocket                   1.5.3
yaml                        6.0.1
zarr                        2.13.6
zipp                        NA
zmq                         25.0.2
zoneinfo                    NA
-----
IPython             8.18.1
jupyter_client      8.2.0
jupyter_core        5.3.0
jupyterlab          4.0.9
notebook            6.5.4
-----
Python 3.9.15 | packaged by conda-forge | (main, Nov 22 2022, 08:45:29) [GCC 10.4.0]
Linux-3.10.0-1160.95.1.el7.x86_64-x86_64-with-glibc2.17
-----
Session information updated at 2023-12-11 20:14
@ivirshup
Copy link
Member

Thanks for opening the issue.

In the first case it looks like pandas isn't actually inferring the dtype, so this is somewhat expected. In contrast, this works:

import anndata
import numpy as np
import pandas as pd

print(anndata.__version__)
# '0.10.3'

adata = anndata.AnnData(
    X=None,
    obs=pd.DataFrame({
        "test_bool_null": pd.array([True, False, None, False]),
    }),
)

adata.write_h5ad("test.h5ad")

anndata.read_h5ad("test.h5ad").obs
   test_bool_null
0            True
1           False
2            <NA>
3           False

I believe this doesn't get infered with the code you wrote since pandas currently marks their nullable boolean type as experimental (docs).

We could probably think about trying to infer this at write time.


Similar issue if the file already contains a boolean array with null values (written through the R interface)

Which R interface?

Since I can read a nullable boolean array written by this library, I can't address this without more information. At the least, info about which library wrote it and ideally a demonstration file. That is a curious error though.

@jkanche
Copy link
Author

jkanche commented Dec 12, 2023

Thanks @ivirshup , a minimal example of reproducing this issue using anndataR

also just installed the recent version of anndataR from github, so the version info is 0.99.0 for this package.

Create an AnnData object in R and save it as an H5ad

library(anndataR)

ad <- AnnData(
  X = matrix(1:15, 3L, 5L),
  obs = data.frame(cell = 1:3, bool_null = c(NA, NA, TRUE)),
  var = data.frame(gene = 1:5),
  obs_names = LETTERS[1:3],
  var_names = letters[1:5]
)

write_h5ad(ad, path = "test_bool_R.h5ad")

trying to read this in Python

anndata.read_h5ad("test_bool_R.h5ad")
Traceback:
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 anndata.read_h5ad("test_bool_R.h5ad")

File ~/.local/lib/python3.9/site-packages/anndata/_io/h5ad.py:254, in read_h5ad(filename, backed, as_sparse, as_sparse_fmt, chunk_size)
    251         return read_dataframe(elem)
    252     return func(elem)
--> 254 adata = read_dispatched(f, callback=callback)
    256 # Backwards compat (should figure out which version)
    257 if "raw.X" in f:

File ~/.local/lib/python3.9/site-packages/anndata/experimental/_dispatch_io.py:46, in read_dispatched(elem, callback)
     42 from anndata._io.specs import _REGISTRY, Reader
     44 reader = Reader(_REGISTRY, callback=callback)
---> 46 return reader.read_elem(elem)

File ~/.local/lib/python3.9/site-packages/anndata/_io/utils.py:205, in report_read_key_on_error.<locals>.func_wrapper(*args, **kwargs)
    203         break
    204 try:
--> 205     return func(*args, **kwargs)
    206 except Exception as e:
    207     add_key_note(e, elem, elem.name, "read")

File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/registry.py:249, in Reader.read_elem(self, elem, modifiers)
    247 read_func = partial(read_func, _reader=self)
    248 if self.callback is not None:
--> 249     return self.callback(read_func, elem.name, elem, iospec=get_spec(elem))
    250 else:
    251     return read_func(elem)

File ~/.local/lib/python3.9/site-packages/anndata/_io/h5ad.py:235, in read_h5ad.<locals>.callback(func, elem_name, elem, iospec)
    232 def callback(func, elem_name: str, elem, iospec):
    233     if iospec.encoding_type == "anndata" or elem_name.endswith("/"):
    234         return AnnData(
--> 235             **{
    236                 # This is covering up backwards compat in the anndata initializer
    237                 # In most cases we should be able to call `func(elen[k])` instead
    238                 k: read_dispatched(elem[k], callback)
    239                 for k in elem.keys()
    240                 if not k.startswith("raw.")
    241             }
    242         )
    243     elif elem_name.startswith("/raw."):
    244         return None

File ~/.local/lib/python3.9/site-packages/anndata/_io/h5ad.py:238, in <dictcomp>(.0)
    232 def callback(func, elem_name: str, elem, iospec):
    233     if iospec.encoding_type == "anndata" or elem_name.endswith("/"):
    234         return AnnData(
    235             **{
    236                 # This is covering up backwards compat in the anndata initializer
    237                 # In most cases we should be able to call `func(elen[k])` instead
--> 238                 k: read_dispatched(elem[k], callback)
    239                 for k in elem.keys()
    240                 if not k.startswith("raw.")
    241             }
    242         )
    243     elif elem_name.startswith("/raw."):
    244         return None

File ~/.local/lib/python3.9/site-packages/anndata/experimental/_dispatch_io.py:46, in read_dispatched(elem, callback)
     42 from anndata._io.specs import _REGISTRY, Reader
     44 reader = Reader(_REGISTRY, callback=callback)
---> 46 return reader.read_elem(elem)

File ~/.local/lib/python3.9/site-packages/anndata/_io/utils.py:205, in report_read_key_on_error.<locals>.func_wrapper(*args, **kwargs)
    203         break
    204 try:
--> 205     return func(*args, **kwargs)
    206 except Exception as e:
    207     add_key_note(e, elem, elem.name, "read")

File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/registry.py:249, in Reader.read_elem(self, elem, modifiers)
    247 read_func = partial(read_func, _reader=self)
    248 if self.callback is not None:
--> 249     return self.callback(read_func, elem.name, elem, iospec=get_spec(elem))
    250 else:
    251     return read_func(elem)

File ~/.local/lib/python3.9/site-packages/anndata/_io/h5ad.py:251, in read_h5ad.<locals>.callback(func, elem_name, elem, iospec)
    248     return _read_raw(f, as_sparse, rdasp)
    249 elif elem_name in {"/obs", "/var"}:
    250     # Backwards compat
--> 251     return read_dataframe(elem)
    252 return func(elem)

File ~/.local/lib/python3.9/site-packages/anndata/_io/h5ad.py:313, in read_dataframe(group)
    311     return read_dataframe_legacy(group)
    312 else:
--> 313     return read_elem(group)

File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/registry.py:341, in read_elem(elem)
    329 def read_elem(elem: StorageType) -> Any:
    330     """
    331     Read an element from a store.
    332 
   (...)
    339         The stored element.
    340     """
--> 341     return Reader(_REGISTRY).read_elem(elem)

File ~/.local/lib/python3.9/site-packages/anndata/_io/utils.py:205, in report_read_key_on_error.<locals>.func_wrapper(*args, **kwargs)
    203         break
    204 try:
--> 205     return func(*args, **kwargs)
    206 except Exception as e:
    207     add_key_note(e, elem, elem.name, "read")

File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/registry.py:251, in Reader.read_elem(self, elem, modifiers)
    249     return self.callback(read_func, elem.name, elem, iospec=get_spec(elem))
    250 else:
--> 251     return read_func(elem)

File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/methods.py:694, in read_dataframe(elem, _reader)
    691 columns = list(_read_attr(elem.attrs, "column-order"))
    692 idx_key = _read_attr(elem.attrs, "_index")
    693 df = pd.DataFrame(
--> 694     {k: _reader.read_elem(elem[k]) for k in columns},
    695     index=_reader.read_elem(elem[idx_key]),
    696     columns=columns if len(columns) else None,
    697 )
    698 if idx_key != "_index":
    699     df.index.name = idx_key

File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/methods.py:694, in <dictcomp>(.0)
    691 columns = list(_read_attr(elem.attrs, "column-order"))
    692 idx_key = _read_attr(elem.attrs, "_index")
    693 df = pd.DataFrame(
--> 694     {k: _reader.read_elem(elem[k]) for k in columns},
    695     index=_reader.read_elem(elem[idx_key]),
    696     columns=columns if len(columns) else None,
    697 )
    698 if idx_key != "_index":
    699     df.index.name = idx_key

File ~/.local/lib/python3.9/site-packages/anndata/_io/utils.py:205, in report_read_key_on_error.<locals>.func_wrapper(*args, **kwargs)
    203         break
    204 try:
--> 205     return func(*args, **kwargs)
    206 except Exception as e:
    207     add_key_note(e, elem, elem.name, "read")

File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/registry.py:251, in Reader.read_elem(self, elem, modifiers)
    249     return self.callback(read_func, elem.name, elem, iospec=get_spec(elem))
    250 else:
--> 251     return read_func(elem)

File ~/.local/lib/python3.9/site-packages/anndata/_io/specs/methods.py:852, in read_nullable_boolean(elem, _reader)
    848 @_REGISTRY.register_read(H5Group, IOSpec("nullable-boolean", "0.1.0"))
    849 @_REGISTRY.register_read(ZarrGroup, IOSpec("nullable-boolean", "0.1.0"))
    850 def read_nullable_boolean(elem, _reader):
    851     if "mask" in elem:
--> 852         return pd.arrays.BooleanArray(
    853             _reader.read_elem(elem["values"]), mask=_reader.read_elem(elem["mask"])
    854         )
    855     else:
    856         return pd.array(_reader.read_elem(elem["values"]))

File /apps/user/gpy/envs/dev/GPy39/lib/python3.9/site-packages/pandas/core/arrays/boolean.py:299, in BooleanArray.__init__(self, values, mask, copy)
    295 def __init__(
    296     self, values: np.ndarray, mask: np.ndarray, copy: bool = False
    297 ) -> None:
    298     if not (isinstance(values, np.ndarray) and values.dtype == np.bool_):
--> 299         raise TypeError(
    300             "values should be boolean numpy array. Use "
    301             "the 'pd.array' function instead"
    302         )
    303     self._dtype = BooleanDtype()
    304     super().__init__(values, mask, copy=copy)

TypeError: values should be boolean numpy array. Use the 'pd.array' function instead
Error raised while reading key '/obs/bool_null' of <class 'h5py._hl.group.Group'> to /

@ivirshup
Copy link
Member

ivirshup commented Dec 12, 2023

I think this may be an issue with anndataR, where it's not writing the array in a way that h5py recognizes.

@rcannood / @lazappi wdyt? Is this feature meant to work already in anndatar?

I believe the issue is that anndata is expecting h5py to recognize an enumerated boolean type, which I thought rhdf5 had implemented.

@jkanche
Copy link
Author

jkanche commented Dec 12, 2023

I almost wonder if it has to do with _reader.read_elem, The encoding in the file does seem to store these values as boolean vectors.

import h5py
import pandas as pd
import numpy as np

file = h5py.File("./test_bool_R.h5ad")

file["obs/bool_null"]["values"] # <HDF5 dataset "values": shape (3,), type "|i1">
file["obs/bool_null"]["values"][:] # array([0, 0, 1], dtype=int8)
vals = np.array(file["obs/bool_null"]["values"][:], dtype=bool) 
# array([False, False,  True])

file["obs/bool_null"]["mask"] # <HDF5 dataset "mask": shape (3,), type "|i1">
mask=np.array(file["obs/bool_null"]["mask"][:], dtype=bool) 
# array([ True,  True, False])

pd.arrays.BooleanArray(vals, mask=mask)
# <BooleanArray>
# [<NA>, <NA>, True]
# Length: 3, dtype: boolean

@jkanche
Copy link
Author

jkanche commented Dec 12, 2023

For now, i'm going to cast these vectors to satisfy pandas - jkanche@a16d924

@rcannood
Copy link
Contributor

@jkanche @ivirshup boolean enums are not implemented in rhdf5 yet, see grimbough/rhdf5#136. There are quite a few other issues that we're in the process of resolving.

As long as anndataR isn't yet released, I can't recommend you using it just yet.

If you'd like to use an anndata-like interface in R, I could use anndata at CRAN for now.

@ivirshup
Copy link
Member

Thanks for the input!

I would say the writing from R is an issue in the anndataR package rather than here. Maybe we could broaden what's allowed in the future (e.g. try to interpret any byte width integers as bool) but the spec does say boolean array right now.

I think us inferring nullable boolean types is more of a feature request. I will add that to some other planned work about inferring data types better.

@lazappi
Copy link

lazappi commented Dec 13, 2023

The R {anndata} part might also be related to the {rhdf5} version installed. More enum support was only added in the latest release (there is still the issue with attributes linked above though).

@flying-sheep
Copy link
Member

Inferring nullable (boolean and other) types is tracked in #1068. Since the other part of this is upstream , I’m closing this.

@flying-sheep flying-sheep closed this as not planned Won't fix, can't repro, duplicate, stale Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants