No support for mixed column type #726

nspies-celsiustx · 2022-03-14T17:28:49Z

I'm reopening #558 as I'm still having this problem with latest anndata and h5py. It seems that anndata needs to be a bit more defensive about dtypes when writing to h5ad using h5py>=3. I know that most non-numeric columns get converted to categoricals in most situations, but I'm hitting these edge cases with some frequency even with anndata objects that should be large enough to produce categoricals. I suppose some sort of the following logic would solve the issue:

perform the categorical conversion
check for non-numeric, non-categorical columns, convert to {strings/categories?} - it looks like from the previous issue, this is what h5py<3 used to do?

We're mostly running anndata<0.8 so maybe this is fixed there, but I'm also hitting this issue when writing the .uns slot, would be great to apply the same clean-up/failsafe logic there.

Example with long traceback

In [1]: import anndata
   ...: import pandas as pd
   ...:
   ...: adata = anndata.AnnData(obs=pd.DataFrame({"a":["1",2,3]}))
   ...: adata.write_h5ad("temp.h5ad")
/opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_core/anndata.py:121: ImplicitModificationWarning: Transforming to str index.
  warnings.warn("Transforming to str index.", ImplicitModificationWarning)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    213         try:
--> 214             return func(elem, key, val, *args, **kwargs)
    215         except Exception as e:

/opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/specs/registry.py in write_elem(f, k, elem, modifiers, *args, **kwargs)
    171         _REGISTRY.get_writer(dest_type, (t, elem.dtype.kind), modifiers)(
--> 172             f, k, elem, *args, **kwargs
    173         )

/opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/specs/registry.py in wrapper(g, k, *args, **kwargs)
     23         def wrapper(g, k, *args, **kwargs):
---> 24             result = func(g, k, *args, **kwargs)
     25             g[k].attrs.setdefault("encoding-type", spec.encoding_type)

/opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/specs/methods.py in write_vlen_string_array(f, k, elem, dataset_kwargs)
    343     str_dtype = h5py.special_dtype(vlen=str)
--> 344     f.create_dataset(k, data=elem.astype(str_dtype), dtype=str_dtype, **dataset_kwargs)
    345

/opt/conda/envs/test_adata/lib/python3.7/site-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds)
    148
--> 149             dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
    150             dset = dataset.Dataset(dsid)

/opt/conda/envs/test_adata/lib/python3.7/site-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, name, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times, external, track_order, dcpl, allow_unknown_filter)
    144     if (data is not None) and (not isinstance(data, Empty)):
--> 145         dset_id.write(h5s.ALL, h5s.ALL, data)
    146

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5d.pyx in h5py.h5d.DatasetID.write()

h5py/_proxy.pyx in h5py._proxy.dset_rw()

h5py/_conv.pyx in h5py._conv.str2vlen()

h5py/_conv.pyx in h5py._conv.generic_converter()

h5py/_conv.pyx in h5py._conv.conv_str2vlen()

TypeError: Can't implicitly convert non-string objects to strings

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-1-f5142e4c4f28> in <module>
      3
      4 adata = anndata.AnnData(obs=pd.DataFrame({"a":["1",2,3]}))
----> 5 adata.write_h5ad("temp.h5ad")

/opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_core/anndata.py in write_h5ad(self, filename, compression, compression_opts, force_dense, as_dense)
   1922             compression_opts=compression_opts,
   1923             force_dense=force_dense,
-> 1924             as_dense=as_dense,
   1925         )
   1926

/opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/h5ad.py in write_h5ad(filepath, adata, force_dense, as_dense, dataset_kwargs, **kwargs)
     96         elif adata.raw is not None:
     97             write_elem(f, "raw", adata.raw, dataset_kwargs=dataset_kwargs)
---> 98         write_elem(f, "obs", adata.obs, dataset_kwargs=dataset_kwargs)
     99         write_elem(f, "var", adata.var, dataset_kwargs=dataset_kwargs)
    100         write_elem(f, "obsm", dict(adata.obsm), dataset_kwargs=dataset_kwargs)

/opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    212     def func_wrapper(elem, key, val, *args, **kwargs):
    213         try:
--> 214             return func(elem, key, val, *args, **kwargs)
    215         except Exception as e:
    216             if "Above error raised while writing key" in format(e):

/opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/specs/registry.py in write_elem(f, k, elem, modifiers, *args, **kwargs)
    173         )
    174     else:
--> 175         _REGISTRY.get_writer(dest_type, t, modifiers)(f, k, elem, *args, **kwargs)
    176
    177

/opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/specs/registry.py in wrapper(g, k, *args, **kwargs)
     22         @wraps(func)
     23         def wrapper(g, k, *args, **kwargs):
---> 24             result = func(g, k, *args, **kwargs)
     25             g[k].attrs.setdefault("encoding-type", spec.encoding_type)
     26             g[k].attrs.setdefault("encoding-version", spec.encoding_version)

/opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/specs/methods.py in write_dataframe(f, key, df, dataset_kwargs)
    510     for colname, series in df.items():
    511         # TODO: this should write the "true" representation of the series (i.e. the underlying array or ndarray depending)
--> 512         write_elem(group, colname, series._values, dataset_kwargs=dataset_kwargs)
    513
    514

/opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    222                     f"Above error raised while writing key {key!r} of {type(elem)} "
    223                     f"to {parent}"
--> 224                 ) from e
    225
    226     return func_wrapper

TypeError: Can't implicitly convert non-string objects to strings

Above error raised while writing key 'a' of <class 'h5py._hl.group.Group'> to /

and

In [4]: anndata.__version__, h5py.__version__
Out[4]: ('0.8.0rc1', '3.6.0')

The text was updated successfully, but these errors were encountered:

ivirshup · 2022-03-14T17:35:07Z

The check we use for converting to categorical is if there are repeated values in a string column. This wouldn't quite trigger in your case.

I think the issue here is you have a mixed type column in obs, which isn't something we generally support IO for. I'm not sure getting back a column of ["1", "2", "3"] (what h5py 2.0 would have done) is the correct behavior here – since that's not what you tried to write.

nspies-celsiustx · 2022-03-14T17:42:48Z

The check we use for converting to categorical is if there are repeated values in a string column. This wouldn't quite trigger in your case.

I think the issue here is you have a mixed type column in obs, which isn't something we generally support IO for. I'm not sure getting back a column of ["1", "2", "3"] (what h5py 2.0 would have done) is the correct behavior here – since that's not what you tried to write.

Yeah, I'm willing to accept that there isn't a single correct way of doing this, but at the end of the day, I'm getting burned pretty frequently by long-running code that fails in the middle because the file can't be written. I'd much prefer some sort of conversion to occur, especially if it's consistent with how the conversion was being performed previously. Some code we're running seems to return dtype: object series that have only numeric values (something like the result of pd.Series([1,2,3], dtype=object)).

Would more aggressively converting to categorical, eg for mixed types, be another option?

ivirshup · 2022-03-14T17:49:45Z

I think a "proactively convert all contents to something we can write" method could be reasonable.

Falling back to pickling could also be useful for your long running tasks. That way we won't break anything by converting it wrong.

But also, do you know why your getting object valued columns?

nspies-celsiustx · 2022-03-14T18:59:59Z

I think a "proactively convert all contents to something we can write" method could be reasonable.

Yeah, I think this would be great.

Falling back to pickling could also be useful for your long running tasks. That way we won't break anything by converting it wrong.

Good suggestion!

But also, do you know why your getting object valued columns?

Obviously this is the short-term solution, but these issues keep on cropping up so ideal to implement the convert-all-contents-to-something-we-can-write method. If you give some pointers on general approach ie where to look, I'm happy to put together a PR.

ivirshup · 2022-03-15T21:38:09Z

ideal to implement the convert-all-contents-to-something-we-can-write method.

The more I think about this, the less I like it feature. If you write a file, but reading it gives you back something different I don't think the writing was very useful. For cases where there's something we don't support, I'd think interoperability gives way to just getting the data stored.

An alternative here: Maybe we can skip aggressive coercion, write all we can, and just pickle the elements we can't write?

This idea has been floating around for a bit.

Comments from some old code:

File ~/anaconda3/envs/scvelo2/lib/python3.8/site-packages/anndata/_io/h5ad.py:148, in write_not_implemented(f, key, value, dataset_kwargs)
    143 @report_write_key_on_error
    144 def write_not_implemented(f, key, value, dataset_kwargs=MappingProxyType({})):
    145     # If it’s not an array, try and make it an array. If that fails, pickle it.
    146     # Maybe rethink that, maybe this should just pickle,
    147     # and have explicit implementations for everything else
--> 148     raise NotImplementedError(
    149         f"Failed to write value for {key}, "
    150         f"since a writer for type {type(value)} has not been implemented yet."
    151     )

But if the elements breaking your write methods are like: pd.Series([1,2,3], dtype=object), I'm happy to coerce those to integers.

I think this would basically look like walking the tree of the object, and calling something like pd.api.types.infer_dtype on anything that looks like an object. This would be quite similar to pd.DataFrame.infer_objects. For how to walk the object, it would be nice if there was some reuse of IO code, but that might be difficult at the moment.

@Zethson and @Imipenem may have some additional insight here, as they're working with quite heterogenous medical record data over at theislab/ehrapy.

nspies-celsiustx · 2022-03-16T13:26:33Z

The more I think about this, the less I like it feature. If you write a file, but reading it gives you back something different I don't think the writing was very useful. For cases where there's something we don't support, I'd think interoperability gives way to just getting the data stored.

An alternative here: Maybe we can skip aggressive coercion, write all we can, and just pickle the elements we can't write?

I'm not entirely against pickling, but it's not a great long-term storage solution as changes to eg pandas could make it impossible to deserialize. I would try as hard as possible to do faithful conversions and as a final fallback convert to pickle but with a warning message noting which slots needed that conversion - and allow reading back the h5ad even if some fields fail to unpickle.

Would more aggressive conversion to categorical work as well?

This idea has been floating around for a bit.
But if the elements breaking your write methods are like: pd.Series([1,2,3], dtype=object), I'm happy to coerce those to integers.

I think this would basically look like walking the tree of the object, and calling something like pd.api.types.infer_dtype on anything that looks like an object. This would be quite similar to pd.DataFrame.infer_objects. For how to walk the object, it would be nice if there was some reuse of IO code, but that might be difficult at the moment.

Irrespective of other solutions, applying pd.DataFrame.infer_objects() sounds like a great idea.

ivirshup · 2022-03-16T14:08:53Z

I'm not entirely against pickling, but it's not a great long-term storage solution

Yeah, I would say that you have to opt-in to the pickling behavior. And the documentation would say "please don't do this for anything other than ephemeral storage, and even then please don't". This could be a "permissive" writing mode.

and allow reading back the h5ad even if some fields fail to unpickle.

I have also been wanting to have a permissive reading mode. E.g. if you don't recognize the IO spec for a single element, you can skip it.

Would more aggressive conversion to categorical work as well?

I think this is the route ehrapy had been taking, but they have run into issues with it.pd.Series(["a", 2, 3]).astype("category") would still fail here.

Irrespective of other solutions, applying pd.DataFrame.infer_objects() sounds like a great idea.

We could have an AnnData.infer_objects method. I'm a bit cautious about using the pandas methods directly since I don't really trust the stability of some of their extension types.

bio-la · 2022-12-11T09:29:20Z

i experience the same behaviour, "TypeError: Can't implicitly convert non-string objects to strings" when I'm attempting to write an anndata which has object, string, float columns all containing some NA values ( i need the NA to signal missing samples for multiple columns )
attempting to convert to string or categorical doesn't solve the issue. No method has been defined for writing <class 'pandas.core.arrays.string_.StringArray'> elements to <class 'h5py._hl.group.Group'>
what's the suggested workaround?
thanks!

scanpy: 1.9.1
anndata: 0.8.0

avpatel18 · 2022-12-22T14:18:50Z

I get the same error after calculating Marker genes (wilcoxon method) and trying to save anndata object again
Is there any solution to this?

_TypeError: Can't implicitly convert non-string objects to strings Above error raised while writing key 'categories' of <class 'h5py._hl.group.Group'> to /_

nroak · 2023-02-28T18:53:09Z

I have the same error as @Avaptel18
Has any solution been release so far?

ivirshup · 2023-03-01T13:51:50Z

The issue reported by @bio-la is tracked in #679 and now is on tracked for inclusion in 0.9.0.

The specific error reported by @avpatel18 could happen for a large number of reasons, some of which are not things we can handle. It may be resolved for some cases which can now be resolved to strings + null values.

Niubile001 · 2023-07-19T02:39:35Z

I have the same error as @Avaptel18
Has any solution been release so far?

I have the same error as @Avaptel18

redst4r · 2023-08-24T12:32:10Z

Same issue as @Avaptel18 on writing to hdf5:

TypeError: Can't implicitly convert non-string objects to strings

It happens if you calculate marker genes with .rank_genes_groups() and filter them (sc.tl.filter_rank_genes_groups): Some elements in adata.uns['rank_genes_groups'] end up as nan due to filtering and h5py complains.

h5py: 3.8.0
anndata: 0.8.0
scanpy: 1.9.3

Any workaround for this? Looked like there's going to be a fix in anndata 0.9., but that was postponed...

avpatel18 · 2023-08-24T14:19:02Z

@redst4r exactly, it happens after (sc.tl.filter_rank_genes_groups).

so far I am just deleting the .uns key layer before saving the anndata object.

Could someone please update #726 if there is any workaround or once the update comes.
Thanks!

flying-sheep · 2023-08-28T09:12:41Z

OK, then you two run into #679. @ivirshup we should prioritize that one because a lot of people run into it, before we address bigger fish like more aggressive coercion and so on.

redst4r · 2023-09-05T18:03:01Z

Here's a minimal example of the bug I run into:

import scanpy as sc
import anndata
print("scanpy", sc.__version__)  # 1.9.3
print("anndata", anndata.__version__)  # 0.8.0
adata = sc.datasets.pbmc3k_processed()

sc.tl.rank_genes_groups(adata, groupby='louvain')
sc.tl.filter_rank_genes_groups(adata, min_fold_change=1)  # commenting this line will make the whole thing run without error
adata.write_h5ad('/tmp/pbmc_bug.h5ad')

TypeError: Can't implicitly convert non-string objects to strings
Above error raised while writing key 'names' of <class 'h5py._hl.group.Group'> to /

As mentioned previously, filter_rank_genes_groups() sets some elements in adata.uns['rank_genes_groups'] to NaN, causing failure when writing to hdf5.

kaarg2 · 2023-09-05T23:34:15Z

Here's a minimal example of the bug I run into:

import scanpy as sc
import anndata
print("scanpy", sc.__version__)  # 1.9.3
print("anndata", anndata.__version__)  # 0.8.0
adata = sc.datasets.pbmc3k_processed()

sc.tl.rank_genes_groups(adata, groupby='louvain')
sc.tl.filter_rank_genes_groups(adata, min_fold_change=1)  # commenting this line will make the whole thing run without error
adata.write_h5ad('/tmp/pbmc_bug.h5ad')

TypeError: Can't implicitly convert non-string objects to strings
Above error raised while writing key 'names' of <class 'h5py._hl.group.Group'> to /

As mentioned previously, filter_rank_genes_groups() sets some elements in adata.uns['rank_genes_groups'] to NaN, causing failure when writing to hdf5.

Deleting 'rank_genes_groups' and 'rank_genes_groups_filtered' resolved the issue for me. Thank you!

del adata.uns['rank_genes_groups_filtered'] del adata.uns['rank_genes_groups']

redst4r · 2023-09-06T10:17:56Z

@kaarg2: That "solves" the problem, but of course the DEG won't be saved to disk then, which is very inconvenient.

@flying-sheep: As you mentioned, any anndata produced by scanpy and its functions should really be writable to disk, everything else is a bug. We should really fix that!

I'd take a stab at the problem myself, but the format that sc.tl.rank_genes_groups() saves DEGs is a bit esoteric (numpy.recarrays) and I dont want to mess with it too much. Not sure if there's a reason for the choice of recarray over just pandas.DataFrames (one per DE-group) (where filtering out rows would be a bit easier), but that's another issue.

In fact the recarray is a bit of a problem here, it forces each DE-group to have the same amount of genes present (as it's shape is genes x groups). The filtering would cause different amount of DEGs per group which cant be stored in the recarray now. I guess that's why it gets filled up with NaN entries instead of just deleting those genes.

Looks like the best options at this point would be:

change the storage format of DEG from (the rather unnatural) np.recarray to pd.DataFrames. Lots of downstream adjustments needed, breaking changes... Something for scanpy 2.0 ...
or add a column is_filtered to the adata.uns['rank_genes_groups_filtered'], marking the filtered genes (rather than setting things to NaN). We just have to make sure downstream functions respect that adata.uns['rank_genes_groups_filtered']['is_filtered'] flag.

flying-sheep · 2023-09-08T09:45:49Z

print("anndata", anndata.__version__)  # 0.8.0

that’s not the newest anndata version.

nans in string columns are supported. I can’t reproduce your example above with the main version.

Since we’re going to release 0.10 in a little bit, I won’t bother figuring out when this was fixed. It’s either fixed or will be later today.

vkartha · 2023-12-12T19:49:17Z

I'm still dealing with this issue, at the very least is it possible to flag what the problematic obs column is? Running obs.dtypes() shows I have a mix of int/float32/float64/category fields in the pandas object (nothing seems special here), but it is really hard to know which column is the one causing the error since it's such a generic message.

I know it's obs related (and not uns) cause I tried to delete the entire uns layer and it still throw this error, plus the only lead I have in the error points to an obs key error:

98 write_elem(f, "obs", adata.obs, dataset_kwargs=dataset_kwargs)
--> 617 write_elem(g, "categories", v.categories._values, dataset_kwargs=dataset_kwargs)

TypeError: Can't implicitly convert non-string objects to strings

Above error raised while writing key 'categories' of <class 'h5py._hl.group.Group'> to /

kaarg2 · 2023-12-19T22:50:14Z

There might be some entries in adata.obs or the title of some obs column contain illegal character such as "/". You can try changing those characters in your adata.obs.

Alternatively, you can use pickle. An example can be found here.

flying-sheep · 2023-12-20T10:23:51Z

@vkartha you’re right, the way the written key is reported is not super helpful right now.

I filed #1272

Rafael-Silva-Oliveira · 2024-01-02T13:56:55Z

Also getting the same issue. Suggestions from @redst4r make sense and should be implemented

flying-sheep · 2024-01-02T14:45:46Z

OK, this issue was closed since there are multiple specific sub-issues, and any further conversation here is just confusing.

To avoid further “me too”s, I’m going to link to the related issues a final time and then lock this thread.

Feel free to open new issues if you think something is not covered by the following:

Removing the recarray is Rethink group IDs in rank_genes_groups scanpy#61, it’s on our list for scanpy 2.0
nullable string columns is issues Nullable string columns #679 and (Semi-)automatic conversion of nullable columns to the appropriate pandas arrays #1068, and we can finally make progess with it since 2 weeks ago or so
better write errors is Write error reporting got worse #1272 and will go into the next patch release today or tomorrow

So you’ll see at least the following starting from tomorrow:
```
Error raised while writing key 'names' of <class 'h5py._hl.group.Group'> to /uns/rank_genes_groups_filtered`
```

ivirshup added needs info topic: io labels Mar 14, 2022

ivirshup mentioned this issue Apr 21, 2022

Writing a concatenated object fails when an obs column has mixed dtype #761

Closed

jeskowagner mentioned this issue Mar 29, 2023

Can't convert string categoricals #963

Closed

flying-sheep added Needs info❔ and removed needs info labels Jun 12, 2023

flying-sheep changed the title ~~Can't implicitly convert non-string objects to strings~~ No support for mixed column type Aug 28, 2023

flying-sheep closed this as completed Sep 8, 2023

Bento007 mentioned this issue Sep 21, 2023

cellxgene-schema must validate that -add-labels will not result in a write failure during ingestion chanzuckerberg/single-cell-curation#405

Closed

scverse locked as resolved and limited conversation to collaborators Jan 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No support for mixed column type #726

No support for mixed column type #726

nspies-celsiustx commented Mar 14, 2022 •

edited by flying-sheep

ivirshup commented Mar 14, 2022

nspies-celsiustx commented Mar 14, 2022

ivirshup commented Mar 14, 2022

nspies-celsiustx commented Mar 14, 2022

ivirshup commented Mar 15, 2022

nspies-celsiustx commented Mar 16, 2022

ivirshup commented Mar 16, 2022 •

edited

bio-la commented Dec 11, 2022

avpatel18 commented Dec 22, 2022

nroak commented Feb 28, 2023

ivirshup commented Mar 1, 2023

Niubile001 commented Jul 19, 2023

redst4r commented Aug 24, 2023

avpatel18 commented Aug 24, 2023

flying-sheep commented Aug 28, 2023 •

edited

redst4r commented Sep 5, 2023

kaarg2 commented Sep 5, 2023

redst4r commented Sep 6, 2023

flying-sheep commented Sep 8, 2023

vkartha commented Dec 12, 2023 •

edited

kaarg2 commented Dec 19, 2023 •

edited by flying-sheep

flying-sheep commented Dec 20, 2023

Rafael-Silva-Oliveira commented Jan 2, 2024

flying-sheep commented Jan 2, 2024 •

edited

No support for mixed column type #726

No support for mixed column type #726

Comments

nspies-celsiustx commented Mar 14, 2022 • edited by flying-sheep

ivirshup commented Mar 14, 2022

nspies-celsiustx commented Mar 14, 2022

ivirshup commented Mar 14, 2022

nspies-celsiustx commented Mar 14, 2022

ivirshup commented Mar 15, 2022

nspies-celsiustx commented Mar 16, 2022

ivirshup commented Mar 16, 2022 • edited

bio-la commented Dec 11, 2022

avpatel18 commented Dec 22, 2022

nroak commented Feb 28, 2023

ivirshup commented Mar 1, 2023

Niubile001 commented Jul 19, 2023

redst4r commented Aug 24, 2023

avpatel18 commented Aug 24, 2023

flying-sheep commented Aug 28, 2023 • edited

redst4r commented Sep 5, 2023

kaarg2 commented Sep 5, 2023

redst4r commented Sep 6, 2023

flying-sheep commented Sep 8, 2023

vkartha commented Dec 12, 2023 • edited

kaarg2 commented Dec 19, 2023 • edited by flying-sheep

flying-sheep commented Dec 20, 2023

Rafael-Silva-Oliveira commented Jan 2, 2024

flying-sheep commented Jan 2, 2024 • edited

nspies-celsiustx commented Mar 14, 2022 •

edited by flying-sheep

ivirshup commented Mar 16, 2022 •

edited

flying-sheep commented Aug 28, 2023 •

edited

vkartha commented Dec 12, 2023 •

edited

kaarg2 commented Dec 19, 2023 •

edited by flying-sheep

flying-sheep commented Jan 2, 2024 •

edited