Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No support for mixed column type #726

Closed
nspies-celsiustx opened this issue Mar 14, 2022 · 24 comments
Closed

No support for mixed column type #726

nspies-celsiustx opened this issue Mar 14, 2022 · 24 comments

Comments

@nspies-celsiustx
Copy link

nspies-celsiustx commented Mar 14, 2022

I'm reopening #558 as I'm still having this problem with latest anndata and h5py. It seems that anndata needs to be a bit more defensive about dtypes when writing to h5ad using h5py>=3. I know that most non-numeric columns get converted to categoricals in most situations, but I'm hitting these edge cases with some frequency even with anndata objects that should be large enough to produce categoricals. I suppose some sort of the following logic would solve the issue:

  1. perform the categorical conversion
  2. check for non-numeric, non-categorical columns, convert to {strings/categories?} - it looks like from the previous issue, this is what h5py<3 used to do?

We're mostly running anndata<0.8 so maybe this is fixed there, but I'm also hitting this issue when writing the .uns slot, would be great to apply the same clean-up/failsafe logic there.

Example with long traceback
In [1]: import anndata
   ...: import pandas as pd
   ...:
   ...: adata = anndata.AnnData(obs=pd.DataFrame({"a":["1",2,3]}))
   ...: adata.write_h5ad("temp.h5ad")
/opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_core/anndata.py:121: ImplicitModificationWarning: Transforming to str index.
  warnings.warn("Transforming to str index.", ImplicitModificationWarning)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    213         try:
--> 214             return func(elem, key, val, *args, **kwargs)
    215         except Exception as e:

/opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/specs/registry.py in write_elem(f, k, elem, modifiers, *args, **kwargs)
    171         _REGISTRY.get_writer(dest_type, (t, elem.dtype.kind), modifiers)(
--> 172             f, k, elem, *args, **kwargs
    173         )

/opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/specs/registry.py in wrapper(g, k, *args, **kwargs)
     23         def wrapper(g, k, *args, **kwargs):
---> 24             result = func(g, k, *args, **kwargs)
     25             g[k].attrs.setdefault("encoding-type", spec.encoding_type)

/opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/specs/methods.py in write_vlen_string_array(f, k, elem, dataset_kwargs)
    343     str_dtype = h5py.special_dtype(vlen=str)
--> 344     f.create_dataset(k, data=elem.astype(str_dtype), dtype=str_dtype, **dataset_kwargs)
    345

/opt/conda/envs/test_adata/lib/python3.7/site-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds)
    148
--> 149             dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
    150             dset = dataset.Dataset(dsid)

/opt/conda/envs/test_adata/lib/python3.7/site-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, name, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times, external, track_order, dcpl, allow_unknown_filter)
    144     if (data is not None) and (not isinstance(data, Empty)):
--> 145         dset_id.write(h5s.ALL, h5s.ALL, data)
    146

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5d.pyx in h5py.h5d.DatasetID.write()

h5py/_proxy.pyx in h5py._proxy.dset_rw()

h5py/_conv.pyx in h5py._conv.str2vlen()

h5py/_conv.pyx in h5py._conv.generic_converter()

h5py/_conv.pyx in h5py._conv.conv_str2vlen()

TypeError: Can't implicitly convert non-string objects to strings

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-1-f5142e4c4f28> in <module>
      3
      4 adata = anndata.AnnData(obs=pd.DataFrame({"a":["1",2,3]}))
----> 5 adata.write_h5ad("temp.h5ad")

/opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_core/anndata.py in write_h5ad(self, filename, compression, compression_opts, force_dense, as_dense)
   1922             compression_opts=compression_opts,
   1923             force_dense=force_dense,
-> 1924             as_dense=as_dense,
   1925         )
   1926

/opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/h5ad.py in write_h5ad(filepath, adata, force_dense, as_dense, dataset_kwargs, **kwargs)
     96         elif adata.raw is not None:
     97             write_elem(f, "raw", adata.raw, dataset_kwargs=dataset_kwargs)
---> 98         write_elem(f, "obs", adata.obs, dataset_kwargs=dataset_kwargs)
     99         write_elem(f, "var", adata.var, dataset_kwargs=dataset_kwargs)
    100         write_elem(f, "obsm", dict(adata.obsm), dataset_kwargs=dataset_kwargs)

/opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    212     def func_wrapper(elem, key, val, *args, **kwargs):
    213         try:
--> 214             return func(elem, key, val, *args, **kwargs)
    215         except Exception as e:
    216             if "Above error raised while writing key" in format(e):

/opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/specs/registry.py in write_elem(f, k, elem, modifiers, *args, **kwargs)
    173         )
    174     else:
--> 175         _REGISTRY.get_writer(dest_type, t, modifiers)(f, k, elem, *args, **kwargs)
    176
    177

/opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/specs/registry.py in wrapper(g, k, *args, **kwargs)
     22         @wraps(func)
     23         def wrapper(g, k, *args, **kwargs):
---> 24             result = func(g, k, *args, **kwargs)
     25             g[k].attrs.setdefault("encoding-type", spec.encoding_type)
     26             g[k].attrs.setdefault("encoding-version", spec.encoding_version)

/opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/specs/methods.py in write_dataframe(f, key, df, dataset_kwargs)
    510     for colname, series in df.items():
    511         # TODO: this should write the "true" representation of the series (i.e. the underlying array or ndarray depending)
--> 512         write_elem(group, colname, series._values, dataset_kwargs=dataset_kwargs)
    513
    514

/opt/conda/envs/test_adata/lib/python3.7/site-packages/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    222                     f"Above error raised while writing key {key!r} of {type(elem)} "
    223                     f"to {parent}"
--> 224                 ) from e
    225
    226     return func_wrapper

TypeError: Can't implicitly convert non-string objects to strings

Above error raised while writing key 'a' of <class 'h5py._hl.group.Group'> to /

and

In [4]: anndata.__version__, h5py.__version__
Out[4]: ('0.8.0rc1', '3.6.0')
@ivirshup
Copy link
Member

The check we use for converting to categorical is if there are repeated values in a string column. This wouldn't quite trigger in your case.

I think the issue here is you have a mixed type column in obs, which isn't something we generally support IO for. I'm not sure getting back a column of ["1", "2", "3"] (what h5py 2.0 would have done) is the correct behavior here – since that's not what you tried to write.

@nspies-celsiustx
Copy link
Author

The check we use for converting to categorical is if there are repeated values in a string column. This wouldn't quite trigger in your case.

I think the issue here is you have a mixed type column in obs, which isn't something we generally support IO for. I'm not sure getting back a column of ["1", "2", "3"] (what h5py 2.0 would have done) is the correct behavior here – since that's not what you tried to write.

Yeah, I'm willing to accept that there isn't a single correct way of doing this, but at the end of the day, I'm getting burned pretty frequently by long-running code that fails in the middle because the file can't be written. I'd much prefer some sort of conversion to occur, especially if it's consistent with how the conversion was being performed previously. Some code we're running seems to return dtype: object series that have only numeric values (something like the result of pd.Series([1,2,3], dtype=object)).

Would more aggressively converting to categorical, eg for mixed types, be another option?

@ivirshup
Copy link
Member

I think a "proactively convert all contents to something we can write" method could be reasonable.

Falling back to pickling could also be useful for your long running tasks. That way we won't break anything by converting it wrong.

But also, do you know why your getting object valued columns?

@nspies-celsiustx
Copy link
Author

I think a "proactively convert all contents to something we can write" method could be reasonable.

Yeah, I think this would be great.

Falling back to pickling could also be useful for your long running tasks. That way we won't break anything by converting it wrong.

Good suggestion!

But also, do you know why your getting object valued columns?

Obviously this is the short-term solution, but these issues keep on cropping up so ideal to implement the convert-all-contents-to-something-we-can-write method. If you give some pointers on general approach ie where to look, I'm happy to put together a PR.

@ivirshup
Copy link
Member

ideal to implement the convert-all-contents-to-something-we-can-write method.

The more I think about this, the less I like it feature. If you write a file, but reading it gives you back something different I don't think the writing was very useful. For cases where there's something we don't support, I'd think interoperability gives way to just getting the data stored.

An alternative here: Maybe we can skip aggressive coercion, write all we can, and just pickle the elements we can't write?

This idea has been floating around for a bit.

Comments from some old code:

File ~/anaconda3/envs/scvelo2/lib/python3.8/site-packages/anndata/_io/h5ad.py:148, in write_not_implemented(f, key, value, dataset_kwargs)
    143 @report_write_key_on_error
    144 def write_not_implemented(f, key, value, dataset_kwargs=MappingProxyType({})):
    145     # If it’s not an array, try and make it an array. If that fails, pickle it.
    146     # Maybe rethink that, maybe this should just pickle,
    147     # and have explicit implementations for everything else
--> 148     raise NotImplementedError(
    149         f"Failed to write value for {key}, "
    150         f"since a writer for type {type(value)} has not been implemented yet."
    151     )

But if the elements breaking your write methods are like: pd.Series([1,2,3], dtype=object), I'm happy to coerce those to integers.

I think this would basically look like walking the tree of the object, and calling something like pd.api.types.infer_dtype on anything that looks like an object. This would be quite similar to pd.DataFrame.infer_objects. For how to walk the object, it would be nice if there was some reuse of IO code, but that might be difficult at the moment.


@Zethson and @Imipenem may have some additional insight here, as they're working with quite heterogenous medical record data over at theislab/ehrapy.

@nspies-celsiustx
Copy link
Author

The more I think about this, the less I like it feature. If you write a file, but reading it gives you back something different I don't think the writing was very useful. For cases where there's something we don't support, I'd think interoperability gives way to just getting the data stored.

An alternative here: Maybe we can skip aggressive coercion, write all we can, and just pickle the elements we can't write?

I'm not entirely against pickling, but it's not a great long-term storage solution as changes to eg pandas could make it impossible to deserialize. I would try as hard as possible to do faithful conversions and as a final fallback convert to pickle but with a warning message noting which slots needed that conversion - and allow reading back the h5ad even if some fields fail to unpickle.

Would more aggressive conversion to categorical work as well?

This idea has been floating around for a bit.
But if the elements breaking your write methods are like: pd.Series([1,2,3], dtype=object), I'm happy to coerce those to integers.

I think this would basically look like walking the tree of the object, and calling something like pd.api.types.infer_dtype on anything that looks like an object. This would be quite similar to pd.DataFrame.infer_objects. For how to walk the object, it would be nice if there was some reuse of IO code, but that might be difficult at the moment.

Irrespective of other solutions, applying pd.DataFrame.infer_objects() sounds like a great idea.

@ivirshup
Copy link
Member

ivirshup commented Mar 16, 2022

I'm not entirely against pickling, but it's not a great long-term storage solution

Yeah, I would say that you have to opt-in to the pickling behavior. And the documentation would say "please don't do this for anything other than ephemeral storage, and even then please don't". This could be a "permissive" writing mode.

and allow reading back the h5ad even if some fields fail to unpickle.

I have also been wanting to have a permissive reading mode. E.g. if you don't recognize the IO spec for a single element, you can skip it.


Would more aggressive conversion to categorical work as well?

I think this is the route ehrapy had been taking, but they have run into issues with it.pd.Series(["a", 2, 3]).astype("category") would still fail here.

Irrespective of other solutions, applying pd.DataFrame.infer_objects() sounds like a great idea.

We could have an AnnData.infer_objects method. I'm a bit cautious about using the pandas methods directly since I don't really trust the stability of some of their extension types.

@bio-la
Copy link

bio-la commented Dec 11, 2022

i experience the same behaviour, "TypeError: Can't implicitly convert non-string objects to strings" when I'm attempting to write an anndata which has object, string, float columns all containing some NA values ( i need the NA to signal missing samples for multiple columns )
attempting to convert to string or categorical doesn't solve the issue. No method has been defined for writing <class 'pandas.core.arrays.string_.StringArray'> elements to <class 'h5py._hl.group.Group'>
what's the suggested workaround?
thanks!

scanpy: 1.9.1
anndata: 0.8.0

@avpatel18
Copy link

I get the same error after calculating Marker genes (wilcoxon method) and trying to save anndata object again
Is there any solution to this?

_TypeError: Can't implicitly convert non-string objects to strings Above error raised while writing key 'categories' of <class 'h5py._hl.group.Group'> to /_

@nroak
Copy link

nroak commented Feb 28, 2023

I have the same error as @Avaptel18
Has any solution been release so far?

@ivirshup
Copy link
Member

ivirshup commented Mar 1, 2023

The issue reported by @bio-la is tracked in #679 and now is on tracked for inclusion in 0.9.0.

The specific error reported by @avpatel18 could happen for a large number of reasons, some of which are not things we can handle. It may be resolved for some cases which can now be resolved to strings + null values.

@Niubile001
Copy link

I have the same error as @Avaptel18
Has any solution been release so far?

I have the same error as @Avaptel18

@redst4r
Copy link

redst4r commented Aug 24, 2023

Same issue as @Avaptel18 on writing to hdf5:

TypeError: Can't implicitly convert non-string objects to strings

It happens if you calculate marker genes with .rank_genes_groups() and filter them (sc.tl.filter_rank_genes_groups): Some elements in adata.uns['rank_genes_groups'] end up as nan due to filtering and h5py complains.

  • h5py: 3.8.0
  • anndata: 0.8.0
  • scanpy: 1.9.3

Any workaround for this? Looked like there's going to be a fix in anndata 0.9., but that was postponed...

@avpatel18
Copy link

@redst4r exactly, it happens after (sc.tl.filter_rank_genes_groups).

so far I am just deleting the .uns key layer before saving the anndata object.

Could someone please update #726 if there is any workaround or once the update comes.
Thanks!

@flying-sheep flying-sheep changed the title Can't implicitly convert non-string objects to strings No support for mixed column type Aug 28, 2023
@flying-sheep
Copy link
Member

flying-sheep commented Aug 28, 2023

OK, then you two run into #679. @ivirshup we should prioritize that one because a lot of people run into it, before we address bigger fish like more aggressive coercion and so on.

@redst4r
Copy link

redst4r commented Sep 5, 2023

Here's a minimal example of the bug I run into:

import scanpy as sc
import anndata
print("scanpy", sc.__version__)  # 1.9.3
print("anndata", anndata.__version__)  # 0.8.0
adata = sc.datasets.pbmc3k_processed()

sc.tl.rank_genes_groups(adata, groupby='louvain')
sc.tl.filter_rank_genes_groups(adata, min_fold_change=1)  # commenting this line will make the whole thing run without error
adata.write_h5ad('/tmp/pbmc_bug.h5ad')
TypeError: Can't implicitly convert non-string objects to strings
Above error raised while writing key 'names' of <class 'h5py._hl.group.Group'> to /

As mentioned previously, filter_rank_genes_groups() sets some elements in adata.uns['rank_genes_groups'] to NaN, causing failure when writing to hdf5.

@kaarg2
Copy link

kaarg2 commented Sep 5, 2023

Here's a minimal example of the bug I run into:

import scanpy as sc
import anndata
print("scanpy", sc.__version__)  # 1.9.3
print("anndata", anndata.__version__)  # 0.8.0
adata = sc.datasets.pbmc3k_processed()

sc.tl.rank_genes_groups(adata, groupby='louvain')
sc.tl.filter_rank_genes_groups(adata, min_fold_change=1)  # commenting this line will make the whole thing run without error
adata.write_h5ad('/tmp/pbmc_bug.h5ad')
TypeError: Can't implicitly convert non-string objects to strings
Above error raised while writing key 'names' of <class 'h5py._hl.group.Group'> to /

As mentioned previously, filter_rank_genes_groups() sets some elements in adata.uns['rank_genes_groups'] to NaN, causing failure when writing to hdf5.

Deleting 'rank_genes_groups' and 'rank_genes_groups_filtered' resolved the issue for me. Thank you!

del adata.uns['rank_genes_groups_filtered'] del adata.uns['rank_genes_groups']

@redst4r
Copy link

redst4r commented Sep 6, 2023

@kaarg2: That "solves" the problem, but of course the DEG won't be saved to disk then, which is very inconvenient.

@flying-sheep: As you mentioned, any anndata produced by scanpy and its functions should really be writable to disk, everything else is a bug. We should really fix that!

I'd take a stab at the problem myself, but the format that sc.tl.rank_genes_groups() saves DEGs is a bit esoteric (numpy.recarrays) and I dont want to mess with it too much. Not sure if there's a reason for the choice of recarray over just pandas.DataFrames (one per DE-group) (where filtering out rows would be a bit easier), but that's another issue.

In fact the recarray is a bit of a problem here, it forces each DE-group to have the same amount of genes present (as it's shape is genes x groups). The filtering would cause different amount of DEGs per group which cant be stored in the recarray now. I guess that's why it gets filled up with NaN entries instead of just deleting those genes.

Looks like the best options at this point would be:

  1. change the storage format of DEG from (the rather unnatural) np.recarray to pd.DataFrames. Lots of downstream adjustments needed, breaking changes... Something for scanpy 2.0 ...
  2. or add a column is_filtered to the adata.uns['rank_genes_groups_filtered'], marking the filtered genes (rather than setting things to NaN). We just have to make sure downstream functions respect that adata.uns['rank_genes_groups_filtered']['is_filtered'] flag.

@flying-sheep
Copy link
Member

print("anndata", anndata.__version__)  # 0.8.0

that’s not the newest anndata version.

nans in string columns are supported. I can’t reproduce your example above with the main version.

Since we’re going to release 0.10 in a little bit, I won’t bother figuring out when this was fixed. It’s either fixed or will be later today.

@vkartha
Copy link

vkartha commented Dec 12, 2023

I'm still dealing with this issue, at the very least is it possible to flag what the problematic obs column is? Running obs.dtypes() shows I have a mix of int/float32/float64/category fields in the pandas object (nothing seems special here), but it is really hard to know which column is the one causing the error since it's such a generic message.

I know it's obs related (and not uns) cause I tried to delete the entire uns layer and it still throw this error, plus the only lead I have in the error points to an obs key error:

98 write_elem(f, "obs", adata.obs, dataset_kwargs=dataset_kwargs)
--> 617 write_elem(g, "categories", v.categories._values, dataset_kwargs=dataset_kwargs)

TypeError: Can't implicitly convert non-string objects to strings

Above error raised while writing key 'categories' of <class 'h5py._hl.group.Group'> to /

@kaarg2
Copy link

kaarg2 commented Dec 19, 2023

There might be some entries in adata.obs or the title of some obs column contain illegal character such as "/". You can try changing those characters in your adata.obs.

Alternatively, you can use pickle. An example can be found here.

@flying-sheep
Copy link
Member

@vkartha you’re right, the way the written key is reported is not super helpful right now.

I filed #1272

@Rafael-Silva-Oliveira
Copy link

Also getting the same issue. Suggestions from @redst4r make sense and should be implemented

@flying-sheep
Copy link
Member

flying-sheep commented Jan 2, 2024

OK, this issue was closed since there are multiple specific sub-issues, and any further conversation here is just confusing.

To avoid further “me too”s, I’m going to link to the related issues a final time and then lock this thread.

Feel free to open new issues if you think something is not covered by the following:

@scverse scverse locked as resolved and limited conversation to collaborators Jan 2, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests