Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDEP-13: The pandas Logical Type System #58455

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Apr 27, 2024

@pandas-dev/pandas-core


## Proposal

Derived from the hierarchical visual in the previous section, this PDEP proposes that pandas supports at least all of the following _logical_ types, excluding any type widths for brevity:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC we already support all of these, just not with all three of the storages. Is this paragraph suggesting that we implement all of these for all three (four if we include sparse) storages?

Copy link
Member Author

@WillAyd WillAyd Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all - this is just suggesting that these be the logical types we offer, whether through the factory functions or class-based design that follows.

In today's world it is the responsibility of the user to know what may or may not be offered by the various backends and pick from them accordingly. With dates/datetimes being a good example, a user would have to explicitly pick from some combination of pd.ArrowDtype(pa.date32()), pd.ArrowDtype(pa.timestamp(<unit>)) and datetime64[unit] since we don't offer any logical abstraction.

In the model of this PDEP, we would offer something to the effect pd.date() and pd.datetime(), abstracting away the type implementation details. This would be helpful for us as developers to solve things like #58315

This PDEP proposes the following backends should be prioritized in the following order (1. is the highest priority):

1. Arrow
2. pandas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe call this "masked"? calling it "pandas" seems to preference it as being canonical

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does pandas extension make more sense? Masked is maybe too specific, since this also refers to things like dictionary-encoded / categorical

...

@property
def physical_type:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there use cases for this other than the pyarrow strings? i.e. is it really needed for the general case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Other areas where I think it would matter would be:

  • Differentiation between decimal128/decimal256
  • Any parametrized type (e.g. datetime types with different units / timezones, container types with different value types)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, thanks. is there a reason to include the decimalN case but not (int|float|uint)N? is that distinct from itemsize?

its still unclear to me why this is needed in the general case. the way it is presented here suggests this is something i access on my dtype object, so in the DatetimeTZDtype case i would check dtype.physical_type.tz instead of dtype.tz? or i could imagine standardizing the dtype constructors so you would write dtype = pd.datetime(physical_type=something_that_specifices_tz) instead of dtype=pd.datetime(tz=tz). Neither of these seems great, so i feel like I'm missing something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question about the decimal case. If coming from a database world its not uncommon to have to pick between different integer sizes, but I think rare to pick between different decimal sizes. With decimals you would usually just select your precision, from which we could dispatch to a 128 or 256 bit implementation...but yea I'm not sure the best way to approach that

Doing something like pd.datetime(physical_type=something_that_specifies_tz) is definitely not what this PDEP suggests users should do. This PDEP suggests the 99% use case would be pd.datetime(tz=tz) and a user should not care if we use pandas or pyarrow underneath that.

Using #58315 as an example again, if we used the ser.dt.date accessor on a series with a logical type of pd.datetime(), then the user should expect back a pd.date() back. That gives us a lot of flexibility as library authors - maybe we still want pd.datetime() to be backed by our datetime extension by default, but could then use pyarrow's date implementation to back pd.date() by default rather than having to build that from scratch

In today's world we are either asking users to explicitly use pa.timestamp() to get back a true date type from ser.dt.date, or they use the traditional pandas extension type and get back dtype=object when they use the accessor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PDEP suggests the 99% use case would be pd.datetime(tz=tz)

This seems reasonable, but then im not clear on why physical_type is in the base class.

but could then use pyarrow's date implementation to back pd.date() by default

Doesn't this then give the user mixed-and-matched propagation behavior that we agreed was unwanted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be easier to discuss if there were more clarity on what this base class is used for. With that very big caveat, my impression is that the relevant keyword/attribute will vary between string/decimal/datetime/whatever and this doesn't belong in the base class at all.

Are you referring to the physical_type keyword or something else? Every logical type instance should have a physical type underneath. So the "string" type could be physically implemented as a pa.string(), pa.large_string(), object, etc...The "datetime" type could be physically implemented as a pa.timestamp or the existing pandas extension datetime. Right now a decimal would only be backed by either a pa.decimal128 or pa.decimal256

If i have a non-pyarrow pd.datetime Series and i do ser.dt.date and get a pd.date Series that under the hood is date32[pyarrow] then i have mixed semantics.

In the logical type world that is totally fine. It is up to our implementation to manage the fact that a datetime can be implemented either as a pa.timestamp or our own datetime type, and a date can only be implemented as a pa.date32, but that is not a concern of the end user. I am definitely not suggesting we develop a non-pyarrow date type

BTW im nitpicking a bunch, but im generally positive on the central idea here.

The feedback is very helpful and its good to flesh this conversation out so no worries. Much appreciated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I certainly understand the idea of a physical type, I am not sure such a simple attribute will fit for all cases. For example for our own masked Int64Dtype, what's the physical type? np.int64? (but that's only for the data array, while there is also a mask array)
And is this physical type then always either a numpy type or pyarrow type?

Copy link
Member Author

@WillAyd WillAyd Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider the physical type in that case to be the pd.core.arrays.integer.Int64Dtype but maybe we should add terminology to this PDEP?

In the scope of this document a logical type has no control over the layout of a data buffer or mask; it simply serves as a conceptual placeholder for the idea of a type. The physical type, by contrast, does have at least some control how its data buffer and mask are laid out

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be easier to discuss if there were more clarity on what this base class is used for. With that very big caveat, my impression is that the relevant keyword/attribute will vary between string/decimal/datetime/whatever and this doesn't belong in the base class at all.

Are you referring to the physical_type keyword or something else? Every logical type instance should have a physical type underneath. So the "string" type could be physically implemented as a pa.string(), pa.large_string(), object, etc...The "datetime" type could be physically implemented as a pa.timestamp or the existing pandas extension datetime. Right now a decimal would only be backed by either a pa.decimal128 or pa.decimal256

The "more clarity" request was for what we do with BaseDtype. I guess examples for both usage and construction with, say, pd.DatetimeTZDtype(unit="ns", tz="US/Pacific") would also be helpful.

If i have a non-pyarrow pd.datetime Series and i do ser.dt.date and get a pd.date Series that under the hood is date32[pyarrow] then i have mixed semantics.

In the logical type world that is totally fine. It is up to our implementation to manage the fact that a datetime can be implemented either as a pa.timestamp or our own datetime type, and a date can only be implemented as a pa.date32, but that is not a concern of the end user. I am definitely not suggesting we develop a non-pyarrow date type

You're saying mixed semantics are fine here, but in #58315 you agreed we should avoid that. What changed? Having consistent/predictable semantics is definitely a concern of the end user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "more clarity" request was for what we do with BaseDtype. I guess examples for both usage and construction with, say, pd.DatetimeTZDtype(unit="ns", tz="US/Pacific") would also be helpful.

I think it should be as simple as:

>>> ser = pd.Series(dtype=pd.datetime(unit="ns", tz="US/Pacific")
>>> ser.dtype
pd.DatetimeDtype(unit="ns", tz="US/Pacific")

Or

>>> ser = pd.Series(dtype=pd.DatetimeDtype(unit="ns", tz="US/Pacific")
>>> ser.dtype
pd.DatetimeDtype(unit="ns", tz="US/Pacific")

You're saying mixed semantics are fine here, but in #58315 you agreed we should avoid that. What changed? Having consistent/predictable semantics is definitely a concern of the end user.

I think interoperating between pyarrow / numpy / our own extension types behind the scenes is perfectly fine. It would be problematic to have those implementation details spill out into the end user space though.

When a user says "I want a datetime type" they should just get something that works like a datetime type. When they then say ser.dt.date() they should get back something that works like a date. Having logical types works as a contract with the end user to promise we will deliver that. Whether we use NumPy, pyarrow, or something else underneath to do that is no concern of the end user

...

@property
def missing_value_marker -> pd.NA|np.nan:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in most places we call this "na_value". Is there a reason to use a different name here? is pd.NA|np.nan here just for the example's sake or is that a restriction you want to impose? What about None? Could a 3rd party EA author do something else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - happy to change the name. I would rather not expand the scope beyond pd.NA or np.nan. I think that if we were starting over from scratch we wouldn't even have this field and just have one universally consistent NA marker that was not configurable. With that said, given the history of development I feel like we will have to continue to support both of these for the immediate future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if we were starting over from scratch we wouldn't even have this field and just have one universally consistent NA marker that was not configurable.

This is one of the main comments I was going to give: IMO this should not be configurable, even though it is true we live currently in this non-ideal world of multiple missing sentinels.
And it might makes things also more complex / dependent on other discussions, but I think for designing those new logical types we should somewhat assume we will in parallel also do a PDEP on converging on a single NA value for missing sentinel (it will actually makes things simpler for the design in this PDEP, it's just a bit a mess to discuss / implement all that on the short term because of the interdependencies).

Even if on the short term we have logical types that have either NA or NaN or NaT, I don't think that it should be something that the user can specify like in your example below (pd.string(missing_value_marker=np.nan)). I think generally the missing value sentinel will depend on the physical type or "backend" or however to call it (like if you choose a pyarrow type, then that determines the missing value marker). So we can still have this attribute (i.e. the current na_value already does this), but it's just the query the information, not to set by the user.

With that said, given the history of development I feel like we will have to continue to support both of these for the immediate future.

If we want to move to those new logical types, we will anyway have a period where we have both the current types and the new types. So if we in parallel decide on a NA PDEP, it's also an option that those new logical types just all use NA. Whether that is feasible of course depends on the timeline of both developments (like ideally the string logical dtype for 3.0 would also already have used NA, but it does not because otherwise we could not yet introduce it by default right now)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generally the missing value sentinel will depend on the physical type or "backend" or however to call it (like if you choose a pyarrow type, then that determines the missing value marker).

I agree with this comment, but we have already implemented this in a way where we disconnect the "backend" from the NA marker with types like "pyarrow_numpy". I agree users ideally won't be touching this field (maybe it can be a read-only property?).

If we removed this altogether I think we would lose the ability to support a transition period for dtypes like pyarrow_numpy and the python_numpy string type you are working on, but I may be missing the bigger picture on that.

I would definitely support another PDEP to unify our missing value handling, but would be wary of trying to tackle two large PDEPs at once. With this field being a part of the logical type proposal, I think it helps us support all of the existing types we have already implemented and give us better flexibility to control / deprecate the various NA markers we allow today

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're certainly right that allowing both makes a transition easier, as is what we are doing with the string dtype. And depending on the timelines of both changes (logical types / NA), it might indeed be needed for other dtypes as well.

So I think my main point is that we shouldn't show this as a "parameter" of the dtype construction, but just as a property of a certain dtype instance.

Essentially what we did for the string dtype was to call this a different "backend" (although multiple string dtype instances with different null value might both be using pyarrow or numpy, but the "backend" argument maps to the concrete EA Array class that is being used, and those are actually different), so it is tied to a specific implementation, and not user-specified. And although this is very ugly with the different strange backend names, I think this is a more future-proof option if we think longer term the NA value should not be user configurable.

It might be nicer / clearer to show this in the StringDtype repr as a separate property instead of being merged in the backend name (so eg StringDtype(storage="python", na_value=NA) / StringDtype(storage="python", na_value=NaN) instead of StringDtype(storage="python") / StringDtype(storage="python_numpy") (because what does "python_numpy" mean to a user.. ?).
But at the same time I also wouldn't want to embed too much the idea that na_value is user configurable, it's just that for the transition period we need a way to be able to specify both versions of the dtype.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that na_value is not for users of the pandas library, but for developers of the pandas library. I think it should be read-only, and possibly a "hidden" property (prefix with an underscore)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely support both of what @jorisvandenbossche and @Dr-Irv are saying here. I think where I am getting tripped up is how we allow users today to swap over to this paradigm.

If someone had a code base where they do ser = pd.Series(..., dtype="string[pyarrow_numpy]") and elsewhere maybe just ser = pd.Series(..., dtype="string[pyarrow]"), how do we think they can port to this new design? Or do we not think that is valuable?

To ensure we maintain all of the current functionality of our existing type system(s), a base type structure would need to look something like:

```python
class BaseType:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this then get mixed into ExtensionDtype? if not, what is this used for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it depends on whether we as a team want to go down the path of offering constructors like pd.string() or pd.StringDtype(); with the former approach this would not mix into the ExtensionDtype, since that would be closer to a physical type. If we went with the latter approach, we would be repurposing / clarifying the scope of the pandas extension type system to be totally logical

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I think through it more it may make sense for these to be separate; the existing ExtensionDtype refers to a structure that pandas wants to contain a particular set of values. For extension types it is possible for there not to be a logical type representation in pandas

Maybe this should be renamed to BaseLogicalType to clairfy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then im unclear on what we're doing with it? and if EA authors are supposed to implement something for it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point - for third party EAs we would not be able to offer a logical type, since we don't know what the EA is. Let's see what other teammates think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the .dtype that users interact with? That's the logical type?

Yea that's where my head is at - essentially they would see / interact with a subclass of the BaseType shown here

For external extension dtypes, I think it will be very hard to extend this concept of logical types to multiple external libraries, like your example of two independent external libraries implementing the same logical type.

Yea this is very tricky. I don't know what the best answer is but I don't really want to muddy the waters between an Extension type and a Logical type; I think these are two totally different things. But this PDEP would need to clarify further how users would use Extension types knowing there is no equivalent logical type for them (or maybe we need to have them register themselves as a logical type?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really want to muddy the waters between an Extension type and a Logical type; I think these are two totally different things

I still don't fully follow why they are totally different things. I know our own types are messy (with the mixture of numpy, numpy-based extension dtypes and pyarrow extension dtypes), but for the external extension dtypes that I know, those are almost all to be considered as "logical" types (eg geometries, tensors, ..)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because our third party extension arrays give EA authors theoretical control over data buffers, masks, and algorithms. A logical type should have none of these

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because our third party extension arrays give EA authors theoretical control over data buffers, masks, and algorithms. A logical type should have none of these

That's definitely true that in that sense an EA is also a physical type. But at the same time, in the idea that a user generally specifies the logical type, all external EAs are in practice also a logical type.

So I understand that those are different concepts, but I think in practice the external EA serves both purposes, and so what I don't fully see is 1) how it would be possible for external EAs to distinguish those concepts, and 2) what the value would be of doing so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your point 1) would definitely need to be developed further. Off the top of my head maybe an extension array would need to define an attribute like logical_type going forward. If the logical type if "foo", we would let that be constructed using pd.foo or pd.FooDtype

I think the value to 2) is that we could in theory also allow a third party EA to make its way into our logical type system. There are no shortage of string implementations that can be surmised, maybe there is some value in having an EA register yet another "string" logical type?

- Signed Integer
- Unsigned Integer
- Floating Point
- Fixed Point
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this correspond to pyarrow's decimal128?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either decimal128 or decimal256

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course those names don't have to match exactly with the eventual APIs, but I would personally just call this "Decimal" (which maps to both Python and Arrow)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you call it "Decimal", is there possible confusion with the python standard library decimal ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm even if there is I'm not sure that would matter. Do you think its that different today from the distinction between a python built-in int versus a "int64"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, because the python decimal type lets you specify any number of digits for the precision, as well as rounding.

So I don't know what the correspondence between pyarrow's decimal128 and decimal256 is. Just concerned that if you name this "decimal", there is potential confusion for users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea the arrow types also have a user-configurable precision / scale. The decimal128 variant offers up to 38 units of precision and 256 goes up to 76.

Taking a step back from Python/Arrow I think the problem you are hinting at is not new. Databases like postgres also offer Decimal types with different internal representations. So there might be some translation / edge cases to handle but generally I think calling it Decimal as a logical type signals the intent of the type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One can argue the same for "integer" which for Python is of unlimited size, while our ints are bound by the max of int64. In the case of the decimal stdlib module, there a few more differences, but regardless from that, if we would add a decimal / fixed point dtype to pandas in the future, I would personally still call it "decimal" (also because of python using that name, regardless of some subtle differences)


## String Type Arguments

This PDEP proposes that we maintain only a small subset of string arguments that can be used to construct logical types. Those string arguments are:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

off the top of my head, this would mean deprecating M8[unit], m8[unit], timedelta64[unit], period[freq], interval, interval[subtype, closed], category. (the first three of which are supported by np.dtype(...)). Just checking that these are all intentional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely would want to deprecate using the NumPy types since those would be circumventing the logical type system. I am -0.5 on continuing to allow timedelta64 because that also refers directly to the extension array implementation name rather than a logical type.

Period seems logical to continue supporting. I don't think I've ever constructed an interval that way before so I'll withhold an opinion on that for now, but generally these are offered for legacy support and not the ideal way of creating types going forward

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely would want to deprecate using the NumPy types

See also some of the discussion in #58243.

But IMO it's a very hard step to actually deprecate specifying numpy dtypes, given how widespread this usage is. And certainly for those types where there is an unambiguous mapping to our new logical types (like np.int64), I am not fully sure what the benefit is about deprecating that.
In any case, if we would deprecate that, it would need to go very slowly, and we should only do that a while after introducing the new pandas types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And certainly for those types where there is an unambiguous mapping to our new logical types (like np.int64), I am not fully sure what the benefit is about deprecating that.

The benefit would be having some consistency in the type constructors we allow and it would decouple the user focus on something have to be NumPy.

I agree it would be a very long deprecation, but on the flip side I don't think a user really believes they are subscribing into any particular set of features when they ask for a np.int64 data type. They really just want a 64 bit integer.

Especially with integer types being zero copy there is very little benefit to asking for a np.int64, and I would argue that is even counterproductive with the possibility of missing values being introduced

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning towards thinking deprecating these would cause a lot of churn for users without much clear benefit. To the extent that you want to encourage users to use different patterns, I'm fine with that.

Copy link
Member Author

@WillAyd WillAyd Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowing users to construct physical types like np.int64 alongside logical types like pd.int64 I think defeats the purpose of adding logical types though; in such a case we are just adding more type construction options, which is a growing problem in our current state that this PDEP is trying to solve

There also is a clear benefit to pd.int64 over np.int64, namely the former has flexibility to better handle missing values whereas the latter locks you into numpy semantics

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowing users to construct physical types like np.int64 alongside logical types like pd.int64 I think defeats the purpose of adding logical types though;

In my mind, we wouldn't allow them to create a Series with a "physical type", we would just accept np.int64 as input for the dtype argument, and specifying np.int64 or pd.int64() would give the exact same result: a Series with whatever is the default int64 dtype at that point (right now that is the numpy based int64, in the future maybe a numpy masked int64 or pyarrow int64).

It's certainly true that this doesn't help with the "too many ways" to specify a dtype, and this PDEP would add yet another way, but at least a way that is consistent across all dtypes, and which ensures there is a clear mapping of whathever random input to the resulting dtype.

For example, np.int64 is actually not even a dtype but a scalar constructor (np.dtype("int64") is the dtype), but conveniently accepted as a dtype argument in numpy. Just like the python types are also accepted, in addition to strings.
So with the logical types, you can say that (for now) all of np.int64, np.dtype("int64"), "int64", int, "int" all map to the same as pd.int64(), and longer term we could discuss reducing the number of aliases being accepted.

(I didn't do that with this comment .. ;), but personally I think we should maybe better focus first on the actual (logical) dtypes (the new part in this PDEP), and we can still later consider which aliases should be allowed or not. Short term we need to keep them all working anyway)

EDIT: numpy has very extensive documentation about which values are exactly accepted as dtype descriptors/aliases at https://numpy.org/devdocs/reference/arrays.dtypes.html#specifying-and-constructing-data-types, we should probably do something similar for pandas at some point

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fair. Providing compat is going to be a fact of life for this. Ultimately I don't think it would be that hard for a user to swap out np.int64 for the logical constructor (in such an instance it would just be find/replace), but yea I agree that can be something cleaned up over time. Its not critical to execution of this PDEP


## Bridging Type Systems

An interesting question arises when a user constructs two logical types with differing physical types. If one is backed by NumPy and the other is backed by pyarrow, what should happen?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"constructs" here seems weird. do you mean when two objects with different storages interact? i.e. ser_pyarrow + ser_numpy? If so, I agree with the ordering below, and note that it can be governed by __pandas_priority__.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the ordering as well, but I would be supportive of just raising as well

| string | N/A | pd.StringDtype() | pd.ArrowDtype(pa.string()) |
| datetime | N/A | ??? | pd.ArrowDtype(pa.timestamp("us")) |

It would stand to reason in this approach that you could use a ``pd.DatetimeDtype()`` but no such type exists (there is a ``pd.DatetimeTZDtype`` which requires a timezone).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref #24662

- Period
- Binary
- String
- Dictionary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Categorical instead? Because "dictionary" in Arrow terms is rather the physical type, not logical (like the string column could be dict encoded to save space, but still represent strings). While categorical (like pandas does it now) has specific semantics (i.e. fixed set of allowed values)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong preference, but Dictionary has the advantage of being more generic. With Categorical as a logical type I still think we'd end up offering a separate Dictionary logical type, which may or may not be a bad thing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the point of a Dictionary logical type in addition to Categorical? When would one use that?

Suppose for the case of having dict encoded string values, either you just want to see those as strings, and then this is the String logical type, or either they define a strict set of categories and then this is the Categorical logical type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a logical Dictionary type could offer support for things like Python dictionaries or JSON structures that would seem an odd fit through the lens of a Categorical type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK, we are talking about completely different things then ;) I was assuming you were using "dictionary" in the Arrow sense, but so you are using it in the python sense (in Arrow that is called a map type).

I am fine with in the future having a map/dictionary type (I also don't think this list necessarily needs to be exhaustive, we can also focus on those types for which we have somewhat support nowadays).
But then I do think that "Categorical" is missing in the list above (and now using the term as our existing CategoricalDtype)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK makes sense. Yea I agree let's add Categorical separately then

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW i also thought this referred to a Categorical-like dtype.

there is a JSONDtype in the extension tests (im guessing you implemented it long ago) that IIRC has a ton of xfails. id suggest getting that closer to fully-working before committing ourselves to first-class support for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear I don't think we want to offer a first class logical JSON type, but a user can construct that from a combination of the list / dictionary types

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sidenote, I think in other systems JSON is typically physically stored as a string (so that's a bit different from what the current test JSONDtype does)

@jbrockmendel
Copy link
Member

Some big picture thoughts before I call it a day:

  1. Trim out as much as you can. e.g. I think the Bridging section is more or less orthogonal to everything else and could be suggested separately.

  2. The "users shouldn't care what backend they have" theme really only works when behaviors are identical, which is not the case in most of the relevant cases (xref DEPR: rename 'dtype_backend' #58214). Are there cases other than series.dt.date where this comes up? If not, drop that example to avoid getting bogged down.

  3. It would help to clarify the relationship between what you have in mind and ExtensionDtype.

  4. The physical_type and missing_value_marker keywords/attributes seem more distracting than useful at this stage.

  5. IIUC the main point of the proposal is that pd.int64(family="pyarrow|numpy|masked") is clearer than pd.ArrowDtype(pa.int64()) | pd.Int64Dtype() | np.dtype(np.int64).
    a) The easy part of this would be mapping the pd.int64(...) to an existing dtype.
    b) The hard part would be everything else
    i) does obj.dtype give an "old" dtype or something new? what about series.array? Does this entail another abstraction layer?
    ii) efficient/idiomatic checks for e.g. pd.int64?
    iii) Depending on how this relates to ExtensionDtype, does construct_array_type need to become a non-classmethod?

@WillAyd
Copy link
Member Author

WillAyd commented Apr 29, 2024

Cool thanks again for the thorough feedback. Responding to each question...

  1. Trim out as much as you can. e.g. I think the Bridging section is more or less orthogonal to everything else and could be suggested separately.

Noted and yea I was thinking this as well as I was drafting this. Let's see what other team members think though - it is useful to clarify how the different "backends" should interact, and we have gotten that as a user question before - see #58312

2. The "users shouldn't care what backend they have" theme really only works when behaviors are identical, which is not the case in most of the relevant cases

This is a point we disagree on. I think users necessarily have to get bogged down in the details of the different backends today because of how prevalent they are in our API, but that is not a value added activity for end users. I'm thinking of a database where I go in and just declare columns as INTEGER or DOUBLE PRECISION. I imagine that databases over time have managed different implementations of those types but I am not aware of one where you as an end user would be responsible for picking that

3. It would help to clarify the relationship between what you have in mind and ExtensionDtype.

Definitely - noted for next revision. But at a high level I am of the opinion that logical types have nothing to do with data buffers, null masks, or algorithm implementations. Continuing with the database analogy, a logical query plan would say things like "Aggregate these types" or "Join these types". A physical query plan by contrast would say "use this algorithm for summation" or "use this algorithm for equality". An extension type acts more like the latter case

4. The physical_type and missing_value_marker keywords/attributes seem more distracting than useful at this stage.

I certainly dislike these too, but I'm not sure how our current type system could be ported to a logical type system without them. With our many string implementations being the main motivating case, I think this metadata is required for a logical type instance to expose so other internal methods know what they are dealing with. I would definitely at least love for these to be private, but it comes down to how much control we want to give users to pick "string" versus "string[pyarrow]" versus "string[pyarrow_python]" and whatever other iterations they can use today

5. IIUC the main point of the proposal is that pd.int64(family="pyarrow|numpy|masked") is clearer than pd.ArrowDtype(pa.int64()) | pd.Int64Dtype() | np.dtype(np.int64)

To be clear I would sincerely hope a user specifying the backend is an exceptional case; that is only allowed because our current type system allows / quasi-expects it. The logical type is a stepping stone to untangle user requirements from implementation details

@jbrockmendel
Copy link
Member

This is a point we disagree on. I think users necessarily have to get bogged down in the details of the different backends today because of how prevalent they are in our API, but that is not a value added activity for end users. I'm thinking of a database where I go in and just declare columns as INTEGER or DOUBLE PRECISION. I imagine that databases over time have managed different implementations of those types but I am not aware of one where you as an end user would be responsible for picking that

In the database example do the different implementations have different behaviors? It strikes me as insanesilly to say users don't need to know/think/care about things that affect behavior.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 29, 2024

  1. Trim out as much as you can. e.g. I think the Bridging section is more or less orthogonal to everything else and could be suggested separately.

Noted and yea I was thinking this as well as I was drafting this. Let's see what other team members think though - it is useful to clarify how the different "backends" should interact, and we have gotten that as a user question before - see #58312

I agree there are a few parts of the Bridging section that are probably not needed and can be trimmed (eg the fact about following the C standard rules on casting / conversion), but on the other hand the text the will also have to be expanded significantly in other areas. For example all the questions Brock listed in the last item in his comment above (#58455 (comment)) will need answering. Same for capturing a summary of some of the discussion points we have been talking about (we already generated a lot comments in two days, we can't expect everyone to read all that to have a good idea of the discussion)

In the database example do the different implementations have different behaviors? It strikes me as insanesilly to say users don't need to know/think/care about things that affect behavior.

Setting the different null handling aside for a moment (because we know that is a big difference right now, but one we might want to do something about), I think that most reports about different behaviour should be considered as bug reports / limitations of the current implementation, and not inherent differences.
Some examples that have been referenced in the discussion (#58321, #58307, #58315, #53154) all seem to fit in that bucket to me.

(of course there will always be some differences, like small differences because of numeric stability in floating point algorithms, or different capitalization of the German ß (apache/arrow#34599), and it will always be a somewhat subjective line between implementation details versus clear and intentional behavioural differences. But the list of items linked above are all clear things that could be fixed on our side to ensure we provide a consistent behaviour, not inherent differences. It's our choice whether we want to provide a consistent interface or not)

pd.Series(["foo", "bar", "baz"], dtype=pd.StringDtype(missing_value_marker=np.nan)
pd.Series(["foo", "bar", "baz"], dtype=pd.StringDtype(physical_type=pa.string_view())
```
Note that the class-based approach would reuse existing classes like ``pd.StringDtype`` but change their purpose, whereas the factory function would more explicitly be a new approach. This is an area that requires more discussion amongst the team.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is a dtype, I'd prefer having Dtype in the name. That favors the class based approach. But the factory function approach could be used as well, but I'd prefer pd.stringDtype() over pd.string() to make it clear that it's a dtype.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool - I think @jorisvandenbossche has the same preference. So happy to start focus on that pattern more.

The only concern I have is that we are repurposing something that has long existed. pd.Int64Dtype() gets a new meaning and may no longer return the pandas extension type we created...but maybe that isn't a bad thing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont have a strong preference, but do think it will make the discussion easier to focus on just one variant

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright I'll stick with that in the next revision. We can always revisit if that doesn't work out for whatever reason. Thanks for the feedback all

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, my personal preference would be to go with the shorter functional constructors (pd.string(), pd.int64(), etc).
It's easier to use, less boilerplate, generally more similar to other libraries (this was my impression, but looking up a few examples, that is not necessarily true), and ties less into the exact Dtype class (in case two variants of one logical type want to use a different dtype class, although I am not sure this is necessarily something we want, though)

But agreed that for now, I think you can just use one throughout the document to explain everything, and then I would add one section on this topic mentioning some options and why which choice was made (and we can still further bikeshed on the exact syntax, but that feels like an actual bikeshed separate from the core ideas of the PDEP)

(@Dr-Irv if you want "dtype" in the name, yet another alternative would be to put the functions in a namespace, like pd.dtypes.string(). I seem to remember this also came up when I discussed topic in Basel last year)

The only concern I have is that we are repurposing something that has long existed. pd.Int64Dtype() gets a new meaning and may no longer return the pandas extension type we created...but maybe that isn't a bad thing

Well, we still need to discuss what those logical type constructors would actually return ;) In my mind, it could continue to return our extension types (as I wouldn't necessarily distinguish user-facing extension types and logical types), and then pd.Int64Dtype() would not necessarily have to change return value .. (except that it might gain additional keyword to e.g. get the pyarrow-backed integer dtype)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(@Dr-Irv if you want "dtype" in the name, yet another alternative would be to put the functions in a namespace, like pd.dtypes.string(). I seem to remember this also came up when I discussed topic in Basel last year)

That works for me. I'm not a big fan of continually adding things to the top level pandas namespace.


The ``BaseType`` proposed above also has a property for the ``missing_value_marker``. Operations that use two logical types with different missing value markers should raise, as there is no clear way to prioritize between the various sentinels.

## PDEP-11 History
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## PDEP-11 History
## PDEP-13 History

pd.Series(dtype=pd.list(value_type=pd.string()))
```

## Bridging Type Systems
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to address if people can convert between two physical types that back the same logical type?

below, you discuss where operations between 2 objects of different physical types, but same logical type, cause a priority in the result. But what if users want to convert from one physical type to another?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great callout - I assume you have in mind something like adding a logical int + float.

I think the same rules should still apply, i.e. if one is physically a pyarrow int and the other a numpy float you should end up with logical float type backed by a physical pyarrow float

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or if I have a numpy backed string and I want to convert to a pyarrow backed string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those would both be under the logical string type, so if you tried to concatenate them together you would get a pyarrow string per the rules outlined here. Does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I'm saying that let's say you know that you have a Series of logical type string, and you also know that it is backed by numpy string, and you want to convert that Series to have a physical type of arrow. Don't you need a public API that lets you convert from one physical backing to another?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second is that you want an Index with timestamp[pyarrow] dtype to be (or just behave like) a DatetimeIndex.

Yea if we move to a place where the DatetimeIndex is backed by the logical datetime type, then it becomes an implementation detail for us to control how the two different physical types need to fulfill the requirements of a logical datetime type. I think since the pyarrow / NumPy datetime types are zero-copy save for NaT handling the internal development for that should be a lot less burdensome than having a separate DatetimeIndex[pyarrow]

Copy link
Member Author

@WillAyd WillAyd Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you would want to implement a new date dtype with numpy semantics?

No - there would be no need at all for that. In this particular case, if the logical requirement was to go from a datetime -> date, and physically our array happened to be backed by a NumPy array, we internally would go NumPy datetime -> pa.timestamp -> pa.date, with the pa.date being exposed as a pd.date type; if the array was already backed by pyarrow we could skip the first step

This can get tricky if/when PDEP-10 is reverted.

I think even if PDEP-10 got reverted it would limit the logical types we offer (we'd essentially scrap date, list, dictionary, struct, etc...anything where only pyarrow implements something) until pyarrow becomes a hard dependency again

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - there would be no need at all for that [...]

We're definitely talking past each other, because AFAICT the suggestion here directly contradicts discussion in #58315. Let's put a pin in this until the call on Wednesday and see if we can sort it out then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summarizing this part of the discussion from the call: the communication failure was in 1) what "mixed semantics" I was concerned about and 2) what Will meant by not needing to implement anything new.

Consider

dti = pd.date_range("2016-01-01", periods=3)
ser = pd.Series(dti)
ser.iloc[1] = pd.NaT
df = ser.to_frame("datetime")
df["date"] = df["datetime"].dt.date

Now suppose that df["date"] is backed (perhaps indirectly) by a pyarrow array. When I say the columns have "mixed semantics", I am referring to the propagation of nulls in comparisons

mask1 = df["date"] == dti[0].date()
mask2 = df["date"] != dti[0].date()

In this scenario mask1 and mask2 are both nullable and their second entry is pd.NA. Using these masks for indexing gives surprising results (raising ATM, different if the Ice Cream Agreement goes into effect) to users expecting numpy semantics. A user who has not opted in to null propagation has some columns that behave this way and some that do not. Avoiding that scenario is why we implemented "string[pyarrow_numpy]" to have arrow storage with numpy semantics.

OK, all of that is what I had failed to communicate. What Will was saying when he said (paraphrasing) "we wouldn't need to implement another dtype", that didn't exclude a thin wrapper to retain numpy propagation semantics with pyarrow storage.

@WillAyd can you confirm that the above is a valid summary of what we got sorted out at the end of the call?

Copy link
Member Author

@WillAyd WillAyd May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that's correct and a nice summary. This is why the BaseType in the PDEP has the missing_value_marker / na_value - it explicitly decouples the NA handling from the physical type. It specifically allows the logical type to cover the functionality of string[pyarrow_numpy] as you described above, and as we talked about on the call theoretically also allows you to have a logical date type which uses pyarrow behind the scenes but can present pd.NaT as the na_value to the end user

I think this PDEP could help achieve more consistent NA handling over time since it decouples the user expectations from a "backend", but actually propogating one NA marker consistently is left to a separate PDEP

@mroeschke mroeschke added the PDEP pandas enhancement proposal label Apr 29, 2024

It would stand to reason in this approach that you could use a ``pd.DatetimeDtype()`` but no such type exists (there is a ``pd.DatetimeTZDtype`` which requires a timezone).

The third issue is that the extent to which pandas may support any given type is unclear. Issue [#58307](https://github.com/pandas-dev/pandas/issues/58307) highlights one example. It would stand to reason that you could interchangeably use a pandas datetime64 and a pyarrow timestamp, but that is not always true. Another common example is the use of NumPy fixed length strings, which users commonly try to use even though we claim no real support for them (see [#5764](https://github.com/pandas-dev/pandas/issues/57645)).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The third issue is that the extent to which pandas may support any given type is unclear. Issue [#58307](https://github.com/pandas-dev/pandas/issues/58307) highlights one example. It would stand to reason that you could interchangeably use a pandas datetime64 and a pyarrow timestamp, but that is not always true. Another common example is the use of NumPy fixed length strings, which users commonly try to use even though we claim no real support for them (see [#5764](https://github.com/pandas-dev/pandas/issues/57645)).
The third issue is that the extent to which pandas may support any given type is unclear. Issue [#58307](https://github.com/pandas-dev/pandas/issues/58307) highlights one example. It would stand to reason that you could interchangeably use a pandas datetime64 and a pyarrow timestamp, but that is not always true. Another common example is the use of NumPy fixed length strings, which users commonly try to use even though pandas claim no real support for them (see [#5764](https://github.com/pandas-dev/pandas/issues/57645)).


``na_marker`` is expected to be read-only (see next section). For advanced users that have a particular need for a storage type, they may be able to construct the data type via ``pd.StringDtype(data_manager=np)`` to assert NumPy managed storage. While the PDEP allows constructing in this fashion, operations against that data make no guarantees that they will respect the storage backend and are free to convert to whichever storage the internals of pandas considers optimal (Arrow will typically be preferred).

### Missing Value Handling
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel wanted to ping this off you. When we talked I said would stick with np.nan as the missing value marker, but since we are reusing the pd.StringDtype() in this version of the proposal I don't think we should move that away from pd.NA

Its not the default but maybe this pd.na_marker="legacy" serves as a good middle ground?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there's a global option, it should go through pd.set_option/get_option

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why data_manager instead of storage?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason I wanted to include the word data is because we are strictly referring to the data buffer and nothing else. I don't have a strong preference on the terminology of this though

| pd.IntXXType() | pd.NA | np.nan |
| pd.FloatXXType() | pd.NA | np.nan |
| pd.StringDtype() | pd.NA | np.nan |
| pd.DatetimeType() | pd.NA | pd.NaT |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Period, timedelta64


Missing value handling is a tricky area as developers are split between pd.NA semantics versus np.nan, and the transition path from one to the other is not always clear.

Because this PDEP proposes reuse of the existing pandas extension type system, the default missing value marker will consistently be ``pd.NA``. However, to help with backwards compatibility for users that heavily rely on the equality semantics of np.nan, an option of ``pd.na_marker = "legacy"`` can be set. This would mean that the missing value indicator for logical types would be:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel like you're not engaging with what joris and i tried to explain about "you need to do it all at once" on Wednesday's call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PDEP pandas enhancement proposal
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants