Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Semi-)automatic conversion of nullable columns to the appropriate pandas arrays #1068

Open
grst opened this issue Jul 23, 2023 · 9 comments
Milestone

Comments

@grst
Copy link
Contributor

grst commented Jul 23, 2023

Please describe your wishes and possible alternatives to achieve the desired result.

Since #504, AnnData supports nullable int and bool columns in obs. Support for strings is planned in #679.

However, this only works if the nullable columns are represented as the appropriate pandas Array extension type.

For instance this

import anndata
import numpy as np
import pandas as pd

adata = anndata.AnnData(
    X=None,
    obs=pd.DataFrame().assign(
        test_int=np.array([1, 2, None, 3]),
        test_bool=[True, False, None, False],
    ),
)
adata.write_h5ad("test.h5ad")

fails with TypeError: Can't implicitly convert non-string objects to strings.

After converting the columns to pandas arrays, the object can be saved:

for c in adata.obs.columns:
    adata.obs[c] = pd.array(adata.obs[c].values)
adata.write_h5ad("test.h5ad")

Unfortunately, the pandas extension arrays are little known and Nones might end up in adata.obs for various reasons (for instance scverse/scirpy#434).

I was wondering if such columns should be automatically converted to the appropriate pandas array, e.g. on save?
Or maybe there should be an equivalent to AnnData.strings_to_categoricals that can be called to sanitize such columns?

@flying-sheep
Copy link
Member

Totally.

About the how, well: We never supported “object” columns, there just hasn’t been a string column type when AnnData switched to DataFrames. Initially AnnData used structured arrays which, while clumsy, had the advantage that you couldn’t put data in that wasn’t writable later.

Ideally, AnnData would convert things on assignment so errors surface early. We didn’t do that so far, since subclassing DataFrames is iffy, but it might be the best solution.

There could be a way to deactivate this for people who don’t care about writeability.

@ivirshup
Copy link
Member

I was wondering if such columns should be automatically converted to the appropriate pandas array, e.g. on save?

I think this is doable, but would like to rely on a consistent set of rules defined upstream of us. At the moment, pandas has:

  • Series.infer_dtype which would convert your example to floats
  • DataFrame.convert_dtypes which converts everything to pandas dtypes, regardless of whether it's needed. This includes conversion to dtypes that we don't support like nullable floating point.

We didn’t do that so far, since subclassing DataFrames is iffy, but it might be the best solution.

I don't think subclassing a pandas dataframes is something we can support.

@flying-sheep
Copy link
Member

Probably not, but it would be optimal for users. There would be much fewer issues when people run into errors when trying to add a column. By contrast, running into an issue when writing is much less user friendly, as it means people have to solve all possible issues in bulk instead of running into an early exception.

@ivirshup
Copy link
Member

ivirshup commented Oct 4, 2023

We can probably do a more aggressive inference on object types as a stop gap, but the future of string extension arrays is a little unclear.

I've been told they want to just use arrow directly, but then I heard a lot of pushback on pandas having a direct dependency on pyarrow. So which of the string types should we be defaulting to? The one that adds a big dependency, or the one that is just a thin wrapper around object that sounds like it'll be deprecated?

@flying-sheep
Copy link
Member

Do you have any links for that? Would be nice to see if there’s been an update on the plans

@ivirshup
Copy link
Member

ivirshup commented Oct 6, 2023

My understanding comes from discussion at euroscipy, not sure if there is a consolidated place for discussion.

@flying-sheep
Copy link
Member

flying-sheep commented Oct 9, 2023

Would be great to get some at least semi-official writeup before making decision based on something that might never materialize. Maybe we can get hold of someone involved?

@ivirshup
Copy link
Member

ivirshup commented Oct 9, 2023

@flying-sheep
Copy link
Member

News from the pandas issue:

dtype=string will be arrow backed starting from 3.0 or when you activate the infer_string option

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants