Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDEP-13: The pandas Logical Type System #58455

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
161 changes: 161 additions & 0 deletions web/pandas/pdeps/0013-logical-type-system.md
@@ -0,0 +1,161 @@
# PDEP-13: The pandas Logical Type System

- Created: 27 Apr 2024
- Status: Draft
- Discussion: [#58141](https://github.com/pandas-dev/pandas/issues/58141)
- Author: [Will Ayd](https://github.com/willayd),
- Revision: 1

## Abstract

This PDEP proposes a logical type system for pandas to abstract underlying library differences from end users, clarify the scope of pandas type support, and give pandas developers more flexibility to manage the implementation of types.

## Background

When pandas was originally built, the data types that it exposed were a subset of the NumPy type system. Starting back in version 0.23.0, pandas introduced Extension types, which it also began to use internally as a way of creating arrays instead of exclusively relying upon NumPy. Over the course of the 1.x releases, pandas began using pyarrow for string storage and in version 1.5.0 introduced the high level ``pd.ArrowDtype`` wrapper.

While these new type systems have brought about many great features, they have surfaced three major problems. The first is that we put the onus on users to understand the differences of the physical type implementations. Consider the many ways pandas allows you to create a "string":

```python
dtype=object
dtype=str
dtype="string"
dtype=pd.StringDtype()
dtype=pd.StringDtype("pyarrow")
dtype="string[pyarrow]"
dtype="string[pyarrow_numpy]"
dtype=pd.ArrowDtype(pa.string())
```

Keeping track of all of these iterations and their subtle differences is difficult even for [core maintainers](https://github.com/pandas-dev/pandas/issues/58321).

The second problem is that the conventions for constructing types from a given type backend are inconsistent. Let's review string aliases used to construct certain types:

| logical type | NumPy | pandas extension | pyarrow |
| int | "int64" | "Int64" | "int64[pyarrow]" |
| string | N/A | "string" | N/A |
| datetime | N/A | "datetime64[us]" | "timestamp[us][pyarrow]" |

"string[pyarrow]" is excluded from the above table because it is misleading; while "int64[pyarrow]" definitely gives you a pyarrow backed string, "string[pyarrow]" gives you a pandas extension array which itself then uses pyarrow, which can introduce behavior differences (see [issue 58321](https://github.com/pandas-dev/pandas/issues/58321)).

If you wanted to try and be more explicit about using pyarrow, you could use the ``pd.ArrowDtype`` wrapper. But this unfortunately exposes gaps when trying to use that pattern across all backends:

| logical type | NumPy | pandas extension | pyarrow |
| int | np.int64 | pd.Int64Dtype() | pd.ArrowDtype(pa.int64()) |
| string | N/A | pd.StringDtype() | pd.ArrowDtype(pa.string()) |
| datetime | N/A | ??? | pd.ArrowDtype(pa.timestamp("us")) |
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved

It would stand to reason in this approach that you could use a ``pd.DatetimeDtype()`` but no such type exists (there is a ``pd.DatetimeTZDtype`` which requires a timezone).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref #24662


The third issue is that the extent to which pandas may support any given type is unclear. Issue [#58307](https://github.com/pandas-dev/pandas/issues/58307) highlights one example. It would stand to reason that you could interchangeably use a pandas datetime64 and a pyarrow timestamp, but that is not always true. Another common example is the use of NumPy fixed length strings, which users commonly try to use even though we claim no real support for them (see [#5764](https://github.com/pandas-dev/pandas/issues/57645)).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The third issue is that the extent to which pandas may support any given type is unclear. Issue [#58307](https://github.com/pandas-dev/pandas/issues/58307) highlights one example. It would stand to reason that you could interchangeably use a pandas datetime64 and a pyarrow timestamp, but that is not always true. Another common example is the use of NumPy fixed length strings, which users commonly try to use even though we claim no real support for them (see [#5764](https://github.com/pandas-dev/pandas/issues/57645)).
The third issue is that the extent to which pandas may support any given type is unclear. Issue [#58307](https://github.com/pandas-dev/pandas/issues/58307) highlights one example. It would stand to reason that you could interchangeably use a pandas datetime64 and a pyarrow timestamp, but that is not always true. Another common example is the use of NumPy fixed length strings, which users commonly try to use even though pandas claim no real support for them (see [#5764](https://github.com/pandas-dev/pandas/issues/57645)).


## Assessing the Current Type System(s)

A best effort at visualizing the current type system(s) with types that we currently "support" or reasonably may want to is shown [in this comment](https://github.com/pandas-dev/pandas/issues/58141#issuecomment-2047763186). Note that this does not include the ["pyarrow_numpy"](https://github.com/pandas-dev/pandas/pull/58451) string data type or the string data type that uses the NumPy 2.0 variable length string data type (see [comment](https://github.com/pandas-dev/pandas/issues/57073#issuecomment-2080798081)) as they are under active discussion.

## Proposal

Derived from the hierarchical visual in the previous section, this PDEP proposes that pandas supports at least all of the following _logical_ types, excluding any type widths for brevity:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC we already support all of these, just not with all three of the storages. Is this paragraph suggesting that we implement all of these for all three (four if we include sparse) storages?

Copy link
Member Author

@WillAyd WillAyd Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all - this is just suggesting that these be the logical types we offer, whether through the factory functions or class-based design that follows.

In today's world it is the responsibility of the user to know what may or may not be offered by the various backends and pick from them accordingly. With dates/datetimes being a good example, a user would have to explicitly pick from some combination of pd.ArrowDtype(pa.date32()), pd.ArrowDtype(pa.timestamp(<unit>)) and datetime64[unit] since we don't offer any logical abstraction.

In the model of this PDEP, we would offer something to the effect pd.date() and pd.datetime(), abstracting away the type implementation details. This would be helpful for us as developers to solve things like #58315


- Signed Integer
- Unsigned Integer
- Floating Point
- Fixed Point
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this correspond to pyarrow's decimal128?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either decimal128 or decimal256

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course those names don't have to match exactly with the eventual APIs, but I would personally just call this "Decimal" (which maps to both Python and Arrow)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you call it "Decimal", is there possible confusion with the python standard library decimal ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm even if there is I'm not sure that would matter. Do you think its that different today from the distinction between a python built-in int versus a "int64"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, because the python decimal type lets you specify any number of digits for the precision, as well as rounding.

So I don't know what the correspondence between pyarrow's decimal128 and decimal256 is. Just concerned that if you name this "decimal", there is potential confusion for users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea the arrow types also have a user-configurable precision / scale. The decimal128 variant offers up to 38 units of precision and 256 goes up to 76.

Taking a step back from Python/Arrow I think the problem you are hinting at is not new. Databases like postgres also offer Decimal types with different internal representations. So there might be some translation / edge cases to handle but generally I think calling it Decimal as a logical type signals the intent of the type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One can argue the same for "integer" which for Python is of unlimited size, while our ints are bound by the max of int64. In the case of the decimal stdlib module, there a few more differences, but regardless from that, if we would add a decimal / fixed point dtype to pandas in the future, I would personally still call it "decimal" (also because of python using that name, regardless of some subtle differences)

- Boolean
- Date
- Datetime
- Duration
- Interval
- Period
- Binary
- String
- Dictionary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Categorical instead? Because "dictionary" in Arrow terms is rather the physical type, not logical (like the string column could be dict encoded to save space, but still represent strings). While categorical (like pandas does it now) has specific semantics (i.e. fixed set of allowed values)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong preference, but Dictionary has the advantage of being more generic. With Categorical as a logical type I still think we'd end up offering a separate Dictionary logical type, which may or may not be a bad thing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the point of a Dictionary logical type in addition to Categorical? When would one use that?

Suppose for the case of having dict encoded string values, either you just want to see those as strings, and then this is the String logical type, or either they define a strict set of categories and then this is the Categorical logical type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a logical Dictionary type could offer support for things like Python dictionaries or JSON structures that would seem an odd fit through the lens of a Categorical type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK, we are talking about completely different things then ;) I was assuming you were using "dictionary" in the Arrow sense, but so you are using it in the python sense (in Arrow that is called a map type).

I am fine with in the future having a map/dictionary type (I also don't think this list necessarily needs to be exhaustive, we can also focus on those types for which we have somewhat support nowadays).
But then I do think that "Categorical" is missing in the list above (and now using the term as our existing CategoricalDtype)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK makes sense. Yea I agree let's add Categorical separately then

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW i also thought this referred to a Categorical-like dtype.

there is a JSONDtype in the extension tests (im guessing you implemented it long ago) that IIRC has a ton of xfails. id suggest getting that closer to fully-working before committing ourselves to first-class support for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear I don't think we want to offer a first class logical JSON type, but a user can construct that from a combination of the list / dictionary types

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sidenote, I think in other systems JSON is typically physically stored as a string (so that's a bit different from what the current test JSONDtype does)

- List
- Struct

To ensure we maintain all of the current functionality of our existing type system(s), a base type structure would need to look something like:

```python
class BaseType:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this then get mixed into ExtensionDtype? if not, what is this used for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it depends on whether we as a team want to go down the path of offering constructors like pd.string() or pd.StringDtype(); with the former approach this would not mix into the ExtensionDtype, since that would be closer to a physical type. If we went with the latter approach, we would be repurposing / clarifying the scope of the pandas extension type system to be totally logical

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I think through it more it may make sense for these to be separate; the existing ExtensionDtype refers to a structure that pandas wants to contain a particular set of values. For extension types it is possible for there not to be a logical type representation in pandas

Maybe this should be renamed to BaseLogicalType to clairfy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then im unclear on what we're doing with it? and if EA authors are supposed to implement something for it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point - for third party EAs we would not be able to offer a logical type, since we don't know what the EA is. Let's see what other teammates think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the .dtype that users interact with? That's the logical type?

Yea that's where my head is at - essentially they would see / interact with a subclass of the BaseType shown here

For external extension dtypes, I think it will be very hard to extend this concept of logical types to multiple external libraries, like your example of two independent external libraries implementing the same logical type.

Yea this is very tricky. I don't know what the best answer is but I don't really want to muddy the waters between an Extension type and a Logical type; I think these are two totally different things. But this PDEP would need to clarify further how users would use Extension types knowing there is no equivalent logical type for them (or maybe we need to have them register themselves as a logical type?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really want to muddy the waters between an Extension type and a Logical type; I think these are two totally different things

I still don't fully follow why they are totally different things. I know our own types are messy (with the mixture of numpy, numpy-based extension dtypes and pyarrow extension dtypes), but for the external extension dtypes that I know, those are almost all to be considered as "logical" types (eg geometries, tensors, ..)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because our third party extension arrays give EA authors theoretical control over data buffers, masks, and algorithms. A logical type should have none of these

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because our third party extension arrays give EA authors theoretical control over data buffers, masks, and algorithms. A logical type should have none of these

That's definitely true that in that sense an EA is also a physical type. But at the same time, in the idea that a user generally specifies the logical type, all external EAs are in practice also a logical type.

So I understand that those are different concepts, but I think in practice the external EA serves both purposes, and so what I don't fully see is 1) how it would be possible for external EAs to distinguish those concepts, and 2) what the value would be of doing so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your point 1) would definitely need to be developed further. Off the top of my head maybe an extension array would need to define an attribute like logical_type going forward. If the logical type if "foo", we would let that be constructed using pd.foo or pd.FooDtype

I think the value to 2) is that we could in theory also allow a third party EA to make its way into our logical type system. There are no shortage of string implementations that can be surmised, maybe there is some value in having an EA register yet another "string" logical type?


@property
def dtype_backend -> Literal["pandas", "numpy", "pyarrow"]:
"""
Library is responsible for the array implementation
"""
...

@property
def physical_type:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there use cases for this other than the pyarrow strings? i.e. is it really needed for the general case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Other areas where I think it would matter would be:

  • Differentiation between decimal128/decimal256
  • Any parametrized type (e.g. datetime types with different units / timezones, container types with different value types)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, thanks. is there a reason to include the decimalN case but not (int|float|uint)N? is that distinct from itemsize?

its still unclear to me why this is needed in the general case. the way it is presented here suggests this is something i access on my dtype object, so in the DatetimeTZDtype case i would check dtype.physical_type.tz instead of dtype.tz? or i could imagine standardizing the dtype constructors so you would write dtype = pd.datetime(physical_type=something_that_specifices_tz) instead of dtype=pd.datetime(tz=tz). Neither of these seems great, so i feel like I'm missing something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question about the decimal case. If coming from a database world its not uncommon to have to pick between different integer sizes, but I think rare to pick between different decimal sizes. With decimals you would usually just select your precision, from which we could dispatch to a 128 or 256 bit implementation...but yea I'm not sure the best way to approach that

Doing something like pd.datetime(physical_type=something_that_specifies_tz) is definitely not what this PDEP suggests users should do. This PDEP suggests the 99% use case would be pd.datetime(tz=tz) and a user should not care if we use pandas or pyarrow underneath that.

Using #58315 as an example again, if we used the ser.dt.date accessor on a series with a logical type of pd.datetime(), then the user should expect back a pd.date() back. That gives us a lot of flexibility as library authors - maybe we still want pd.datetime() to be backed by our datetime extension by default, but could then use pyarrow's date implementation to back pd.date() by default rather than having to build that from scratch

In today's world we are either asking users to explicitly use pa.timestamp() to get back a true date type from ser.dt.date, or they use the traditional pandas extension type and get back dtype=object when they use the accessor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PDEP suggests the 99% use case would be pd.datetime(tz=tz)

This seems reasonable, but then im not clear on why physical_type is in the base class.

but could then use pyarrow's date implementation to back pd.date() by default

Doesn't this then give the user mixed-and-matched propagation behavior that we agreed was unwanted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be easier to discuss if there were more clarity on what this base class is used for. With that very big caveat, my impression is that the relevant keyword/attribute will vary between string/decimal/datetime/whatever and this doesn't belong in the base class at all.

Are you referring to the physical_type keyword or something else? Every logical type instance should have a physical type underneath. So the "string" type could be physically implemented as a pa.string(), pa.large_string(), object, etc...The "datetime" type could be physically implemented as a pa.timestamp or the existing pandas extension datetime. Right now a decimal would only be backed by either a pa.decimal128 or pa.decimal256

If i have a non-pyarrow pd.datetime Series and i do ser.dt.date and get a pd.date Series that under the hood is date32[pyarrow] then i have mixed semantics.

In the logical type world that is totally fine. It is up to our implementation to manage the fact that a datetime can be implemented either as a pa.timestamp or our own datetime type, and a date can only be implemented as a pa.date32, but that is not a concern of the end user. I am definitely not suggesting we develop a non-pyarrow date type

BTW im nitpicking a bunch, but im generally positive on the central idea here.

The feedback is very helpful and its good to flesh this conversation out so no worries. Much appreciated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I certainly understand the idea of a physical type, I am not sure such a simple attribute will fit for all cases. For example for our own masked Int64Dtype, what's the physical type? np.int64? (but that's only for the data array, while there is also a mask array)
And is this physical type then always either a numpy type or pyarrow type?

Copy link
Member Author

@WillAyd WillAyd Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider the physical type in that case to be the pd.core.arrays.integer.Int64Dtype but maybe we should add terminology to this PDEP?

In the scope of this document a logical type has no control over the layout of a data buffer or mask; it simply serves as a conceptual placeholder for the idea of a type. The physical type, by contrast, does have at least some control how its data buffer and mask are laid out

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be easier to discuss if there were more clarity on what this base class is used for. With that very big caveat, my impression is that the relevant keyword/attribute will vary between string/decimal/datetime/whatever and this doesn't belong in the base class at all.

Are you referring to the physical_type keyword or something else? Every logical type instance should have a physical type underneath. So the "string" type could be physically implemented as a pa.string(), pa.large_string(), object, etc...The "datetime" type could be physically implemented as a pa.timestamp or the existing pandas extension datetime. Right now a decimal would only be backed by either a pa.decimal128 or pa.decimal256

The "more clarity" request was for what we do with BaseDtype. I guess examples for both usage and construction with, say, pd.DatetimeTZDtype(unit="ns", tz="US/Pacific") would also be helpful.

If i have a non-pyarrow pd.datetime Series and i do ser.dt.date and get a pd.date Series that under the hood is date32[pyarrow] then i have mixed semantics.

In the logical type world that is totally fine. It is up to our implementation to manage the fact that a datetime can be implemented either as a pa.timestamp or our own datetime type, and a date can only be implemented as a pa.date32, but that is not a concern of the end user. I am definitely not suggesting we develop a non-pyarrow date type

You're saying mixed semantics are fine here, but in #58315 you agreed we should avoid that. What changed? Having consistent/predictable semantics is definitely a concern of the end user.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "more clarity" request was for what we do with BaseDtype. I guess examples for both usage and construction with, say, pd.DatetimeTZDtype(unit="ns", tz="US/Pacific") would also be helpful.

I think it should be as simple as:

>>> ser = pd.Series(dtype=pd.datetime(unit="ns", tz="US/Pacific")
>>> ser.dtype
pd.DatetimeDtype(unit="ns", tz="US/Pacific")

Or

>>> ser = pd.Series(dtype=pd.DatetimeDtype(unit="ns", tz="US/Pacific")
>>> ser.dtype
pd.DatetimeDtype(unit="ns", tz="US/Pacific")

You're saying mixed semantics are fine here, but in #58315 you agreed we should avoid that. What changed? Having consistent/predictable semantics is definitely a concern of the end user.

I think interoperating between pyarrow / numpy / our own extension types behind the scenes is perfectly fine. It would be problematic to have those implementation details spill out into the end user space though.

When a user says "I want a datetime type" they should just get something that works like a datetime type. When they then say ser.dt.date() they should get back something that works like a date. Having logical types works as a contract with the end user to promise we will deliver that. Whether we use NumPy, pyarrow, or something else underneath to do that is no concern of the end user

"""
How does the backend physically implement this logical type? i.e. our
logical type may be a "string" and we are using pyarrow underneath -
is it a pa.string(), pa.large_string(), pa.string_view() or something else?
"""
...

@property
def missing_value_marker -> pd.NA|np.nan:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in most places we call this "na_value". Is there a reason to use a different name here? is pd.NA|np.nan here just for the example's sake or is that a restriction you want to impose? What about None? Could a 3rd party EA author do something else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - happy to change the name. I would rather not expand the scope beyond pd.NA or np.nan. I think that if we were starting over from scratch we wouldn't even have this field and just have one universally consistent NA marker that was not configurable. With that said, given the history of development I feel like we will have to continue to support both of these for the immediate future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if we were starting over from scratch we wouldn't even have this field and just have one universally consistent NA marker that was not configurable.

This is one of the main comments I was going to give: IMO this should not be configurable, even though it is true we live currently in this non-ideal world of multiple missing sentinels.
And it might makes things also more complex / dependent on other discussions, but I think for designing those new logical types we should somewhat assume we will in parallel also do a PDEP on converging on a single NA value for missing sentinel (it will actually makes things simpler for the design in this PDEP, it's just a bit a mess to discuss / implement all that on the short term because of the interdependencies).

Even if on the short term we have logical types that have either NA or NaN or NaT, I don't think that it should be something that the user can specify like in your example below (pd.string(missing_value_marker=np.nan)). I think generally the missing value sentinel will depend on the physical type or "backend" or however to call it (like if you choose a pyarrow type, then that determines the missing value marker). So we can still have this attribute (i.e. the current na_value already does this), but it's just the query the information, not to set by the user.

With that said, given the history of development I feel like we will have to continue to support both of these for the immediate future.

If we want to move to those new logical types, we will anyway have a period where we have both the current types and the new types. So if we in parallel decide on a NA PDEP, it's also an option that those new logical types just all use NA. Whether that is feasible of course depends on the timeline of both developments (like ideally the string logical dtype for 3.0 would also already have used NA, but it does not because otherwise we could not yet introduce it by default right now)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generally the missing value sentinel will depend on the physical type or "backend" or however to call it (like if you choose a pyarrow type, then that determines the missing value marker).

I agree with this comment, but we have already implemented this in a way where we disconnect the "backend" from the NA marker with types like "pyarrow_numpy". I agree users ideally won't be touching this field (maybe it can be a read-only property?).

If we removed this altogether I think we would lose the ability to support a transition period for dtypes like pyarrow_numpy and the python_numpy string type you are working on, but I may be missing the bigger picture on that.

I would definitely support another PDEP to unify our missing value handling, but would be wary of trying to tackle two large PDEPs at once. With this field being a part of the logical type proposal, I think it helps us support all of the existing types we have already implemented and give us better flexibility to control / deprecate the various NA markers we allow today

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're certainly right that allowing both makes a transition easier, as is what we are doing with the string dtype. And depending on the timelines of both changes (logical types / NA), it might indeed be needed for other dtypes as well.

So I think my main point is that we shouldn't show this as a "parameter" of the dtype construction, but just as a property of a certain dtype instance.

Essentially what we did for the string dtype was to call this a different "backend" (although multiple string dtype instances with different null value might both be using pyarrow or numpy, but the "backend" argument maps to the concrete EA Array class that is being used, and those are actually different), so it is tied to a specific implementation, and not user-specified. And although this is very ugly with the different strange backend names, I think this is a more future-proof option if we think longer term the NA value should not be user configurable.

It might be nicer / clearer to show this in the StringDtype repr as a separate property instead of being merged in the backend name (so eg StringDtype(storage="python", na_value=NA) / StringDtype(storage="python", na_value=NaN) instead of StringDtype(storage="python") / StringDtype(storage="python_numpy") (because what does "python_numpy" mean to a user.. ?).
But at the same time I also wouldn't want to embed too much the idea that na_value is user configurable, it's just that for the transition period we need a way to be able to specify both versions of the dtype.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that na_value is not for users of the pandas library, but for developers of the pandas library. I think it should be read-only, and possibly a "hidden" property (prefix with an underscore)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely support both of what @jorisvandenbossche and @Dr-Irv are saying here. I think where I am getting tripped up is how we allow users today to swap over to this paradigm.

If someone had a code base where they do ser = pd.Series(..., dtype="string[pyarrow_numpy]") and elsewhere maybe just ser = pd.Series(..., dtype="string[pyarrow]"), how do we think they can port to this new design? Or do we not think that is valuable?

"""
Sentinel used to denote missing values
"""
...
```

The theory behind this PDEP is that most users _should not_ care about the physical type that is being used. But if the abstraction our logical type provides is too much, a user could at least inspect and potentially configure which physical type to use.

With regards to how we may expose such types to end users, there are two currently recognized proposals. The first would use factory functions to create the logical type, i.e. something like:

```python
pd.Series(["foo", "bar", "baz"], dtype=pd.string() # assumed common case
pd.Series(["foo", "bar", "baz"], dtype=pd.string(missing_value_marker=np.nan)
pd.Series(["foo", "bar", "baz"], dtype=pd.string(physical_type=pa.string_view())
```

Another approach would be to classes:

```python
pd.Series(["foo", "bar", "baz"], dtype=pd.StringDtype()
pd.Series(["foo", "bar", "baz"], dtype=pd.StringDtype(missing_value_marker=np.nan)
pd.Series(["foo", "bar", "baz"], dtype=pd.StringDtype(physical_type=pa.string_view())
```
Note that the class-based approach would reuse existing classes like ``pd.StringDtype`` but change their purpose, whereas the factory function would more explicitly be a new approach. This is an area that requires more discussion amongst the team.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is a dtype, I'd prefer having Dtype in the name. That favors the class based approach. But the factory function approach could be used as well, but I'd prefer pd.stringDtype() over pd.string() to make it clear that it's a dtype.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool - I think @jorisvandenbossche has the same preference. So happy to start focus on that pattern more.

The only concern I have is that we are repurposing something that has long existed. pd.Int64Dtype() gets a new meaning and may no longer return the pandas extension type we created...but maybe that isn't a bad thing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont have a strong preference, but do think it will make the discussion easier to focus on just one variant

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright I'll stick with that in the next revision. We can always revisit if that doesn't work out for whatever reason. Thanks for the feedback all

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, my personal preference would be to go with the shorter functional constructors (pd.string(), pd.int64(), etc).
It's easier to use, less boilerplate, generally more similar to other libraries (this was my impression, but looking up a few examples, that is not necessarily true), and ties less into the exact Dtype class (in case two variants of one logical type want to use a different dtype class, although I am not sure this is necessarily something we want, though)

But agreed that for now, I think you can just use one throughout the document to explain everything, and then I would add one section on this topic mentioning some options and why which choice was made (and we can still further bikeshed on the exact syntax, but that feels like an actual bikeshed separate from the core ideas of the PDEP)

(@Dr-Irv if you want "dtype" in the name, yet another alternative would be to put the functions in a namespace, like pd.dtypes.string(). I seem to remember this also came up when I discussed topic in Basel last year)

The only concern I have is that we are repurposing something that has long existed. pd.Int64Dtype() gets a new meaning and may no longer return the pandas extension type we created...but maybe that isn't a bad thing

Well, we still need to discuss what those logical type constructors would actually return ;) In my mind, it could continue to return our extension types (as I wouldn't necessarily distinguish user-facing extension types and logical types), and then pd.Int64Dtype() would not necessarily have to change return value .. (except that it might gain additional keyword to e.g. get the pyarrow-backed integer dtype)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(@Dr-Irv if you want "dtype" in the name, yet another alternative would be to put the functions in a namespace, like pd.dtypes.string(). I seem to remember this also came up when I discussed topic in Basel last year)

That works for me. I'm not a big fan of continually adding things to the top level pandas namespace.


## String Type Arguments

This PDEP proposes that we maintain only a small subset of string arguments that can be used to construct logical types. Those string arguments are:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

off the top of my head, this would mean deprecating M8[unit], m8[unit], timedelta64[unit], period[freq], interval, interval[subtype, closed], category. (the first three of which are supported by np.dtype(...)). Just checking that these are all intentional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely would want to deprecate using the NumPy types since those would be circumventing the logical type system. I am -0.5 on continuing to allow timedelta64 because that also refers directly to the extension array implementation name rather than a logical type.

Period seems logical to continue supporting. I don't think I've ever constructed an interval that way before so I'll withhold an opinion on that for now, but generally these are offered for legacy support and not the ideal way of creating types going forward

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely would want to deprecate using the NumPy types

See also some of the discussion in #58243.

But IMO it's a very hard step to actually deprecate specifying numpy dtypes, given how widespread this usage is. And certainly for those types where there is an unambiguous mapping to our new logical types (like np.int64), I am not fully sure what the benefit is about deprecating that.
In any case, if we would deprecate that, it would need to go very slowly, and we should only do that a while after introducing the new pandas types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And certainly for those types where there is an unambiguous mapping to our new logical types (like np.int64), I am not fully sure what the benefit is about deprecating that.

The benefit would be having some consistency in the type constructors we allow and it would decouple the user focus on something have to be NumPy.

I agree it would be a very long deprecation, but on the flip side I don't think a user really believes they are subscribing into any particular set of features when they ask for a np.int64 data type. They really just want a 64 bit integer.

Especially with integer types being zero copy there is very little benefit to asking for a np.int64, and I would argue that is even counterproductive with the possibility of missing values being introduced

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning towards thinking deprecating these would cause a lot of churn for users without much clear benefit. To the extent that you want to encourage users to use different patterns, I'm fine with that.

Copy link
Member Author

@WillAyd WillAyd Apr 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowing users to construct physical types like np.int64 alongside logical types like pd.int64 I think defeats the purpose of adding logical types though; in such a case we are just adding more type construction options, which is a growing problem in our current state that this PDEP is trying to solve

There also is a clear benefit to pd.int64 over np.int64, namely the former has flexibility to better handle missing values whereas the latter locks you into numpy semantics

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowing users to construct physical types like np.int64 alongside logical types like pd.int64 I think defeats the purpose of adding logical types though;

In my mind, we wouldn't allow them to create a Series with a "physical type", we would just accept np.int64 as input for the dtype argument, and specifying np.int64 or pd.int64() would give the exact same result: a Series with whatever is the default int64 dtype at that point (right now that is the numpy based int64, in the future maybe a numpy masked int64 or pyarrow int64).

It's certainly true that this doesn't help with the "too many ways" to specify a dtype, and this PDEP would add yet another way, but at least a way that is consistent across all dtypes, and which ensures there is a clear mapping of whathever random input to the resulting dtype.

For example, np.int64 is actually not even a dtype but a scalar constructor (np.dtype("int64") is the dtype), but conveniently accepted as a dtype argument in numpy. Just like the python types are also accepted, in addition to strings.
So with the logical types, you can say that (for now) all of np.int64, np.dtype("int64"), "int64", int, "int" all map to the same as pd.int64(), and longer term we could discuss reducing the number of aliases being accepted.

(I didn't do that with this comment .. ;), but personally I think we should maybe better focus first on the actual (logical) dtypes (the new part in this PDEP), and we can still later consider which aliases should be allowed or not. Short term we need to keep them all working anyway)

EDIT: numpy has very extensive documentation about which values are exactly accepted as dtype descriptors/aliases at https://numpy.org/devdocs/reference/arrays.dtypes.html#specifying-and-constructing-data-types, we should probably do something similar for pandas at some point

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fair. Providing compat is going to be a fact of life for this. Ultimately I don't think it would be that hard for a user to swap out np.int64 for the logical constructor (in such an instance it would just be find/replace), but yea I agree that can be something cleaned up over time. Its not critical to execution of this PDEP


- intXX
- uintXX
- floatXX
- string
- datetime64[unit]
- datetime64[unit, tz]

However, new code should be encouraged to use the logical constructors outlined previously. Particularly for aggregate types, trying to encode all of the information into a string can become unwieldy. Instead, keyword argument use should be encouraged:

```python
pd.Series(dtype=pd.list(value_type=pd.string()))
```

## Bridging Type Systems
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to address if people can convert between two physical types that back the same logical type?

below, you discuss where operations between 2 objects of different physical types, but same logical type, cause a priority in the result. But what if users want to convert from one physical type to another?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great callout - I assume you have in mind something like adding a logical int + float.

I think the same rules should still apply, i.e. if one is physically a pyarrow int and the other a numpy float you should end up with logical float type backed by a physical pyarrow float

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or if I have a numpy backed string and I want to convert to a pyarrow backed string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those would both be under the logical string type, so if you tried to concatenate them together you would get a pyarrow string per the rules outlined here. Does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I'm saying that let's say you know that you have a Series of logical type string, and you also know that it is backed by numpy string, and you want to convert that Series to have a physical type of arrow. Don't you need a public API that lets you convert from one physical backing to another?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second is that you want an Index with timestamp[pyarrow] dtype to be (or just behave like) a DatetimeIndex.

Yea if we move to a place where the DatetimeIndex is backed by the logical datetime type, then it becomes an implementation detail for us to control how the two different physical types need to fulfill the requirements of a logical datetime type. I think since the pyarrow / NumPy datetime types are zero-copy save for NaT handling the internal development for that should be a lot less burdensome than having a separate DatetimeIndex[pyarrow]

Copy link
Member Author

@WillAyd WillAyd Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you would want to implement a new date dtype with numpy semantics?

No - there would be no need at all for that. In this particular case, if the logical requirement was to go from a datetime -> date, and physically our array happened to be backed by a NumPy array, we internally would go NumPy datetime -> pa.timestamp -> pa.date, with the pa.date being exposed as a pd.date type; if the array was already backed by pyarrow we could skip the first step

This can get tricky if/when PDEP-10 is reverted.

I think even if PDEP-10 got reverted it would limit the logical types we offer (we'd essentially scrap date, list, dictionary, struct, etc...anything where only pyarrow implements something) until pyarrow becomes a hard dependency again

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - there would be no need at all for that [...]

We're definitely talking past each other, because AFAICT the suggestion here directly contradicts discussion in #58315. Let's put a pin in this until the call on Wednesday and see if we can sort it out then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summarizing this part of the discussion from the call: the communication failure was in 1) what "mixed semantics" I was concerned about and 2) what Will meant by not needing to implement anything new.

Consider

dti = pd.date_range("2016-01-01", periods=3)
ser = pd.Series(dti)
ser.iloc[1] = pd.NaT
df = ser.to_frame("datetime")
df["date"] = df["datetime"].dt.date

Now suppose that df["date"] is backed (perhaps indirectly) by a pyarrow array. When I say the columns have "mixed semantics", I am referring to the propagation of nulls in comparisons

mask1 = df["date"] == dti[0].date()
mask2 = df["date"] != dti[0].date()

In this scenario mask1 and mask2 are both nullable and their second entry is pd.NA. Using these masks for indexing gives surprising results (raising ATM, different if the Ice Cream Agreement goes into effect) to users expecting numpy semantics. A user who has not opted in to null propagation has some columns that behave this way and some that do not. Avoiding that scenario is why we implemented "string[pyarrow_numpy]" to have arrow storage with numpy semantics.

OK, all of that is what I had failed to communicate. What Will was saying when he said (paraphrasing) "we wouldn't need to implement another dtype", that didn't exclude a thin wrapper to retain numpy propagation semantics with pyarrow storage.

@WillAyd can you confirm that the above is a valid summary of what we got sorted out at the end of the call?

Copy link
Member Author

@WillAyd WillAyd May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that's correct and a nice summary. This is why the BaseType in the PDEP has the missing_value_marker / na_value - it explicitly decouples the NA handling from the physical type. It specifically allows the logical type to cover the functionality of string[pyarrow_numpy] as you described above, and as we talked about on the call theoretically also allows you to have a logical date type which uses pyarrow behind the scenes but can present pd.NaT as the na_value to the end user

I think this PDEP could help achieve more consistent NA handling over time since it decouples the user expectations from a "backend", but actually propogating one NA marker consistently is left to a separate PDEP


An interesting question arises when a user constructs two logical types with differing physical types. If one is backed by NumPy and the other is backed by pyarrow, what should happen?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"constructs" here seems weird. do you mean when two objects with different storages interact? i.e. ser_pyarrow + ser_numpy? If so, I agree with the ordering below, and note that it can be governed by __pandas_priority__.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the ordering as well, but I would be supportive of just raising as well


This PDEP proposes the following backends should be prioritized in the following order (1. is the highest priority):

1. Arrow
2. pandas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe call this "masked"? calling it "pandas" seems to preference it as being canonical

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does pandas extension make more sense? Masked is maybe too specific, since this also refers to things like dictionary-encoded / categorical

3. NumPy

One reason for this is that Arrow represents the most efficient and assumedly least-lossy physical representations. An obvious example comes when a pyarrow int64 array with missing data gets added to a NumPy int64 array; casting to the latter would lose data. Another reason is that Arrow represents the fastest growing ecosystem of tooling, and the PDEP author believes improving pandas's interoperability within that landscape is extremely important.

Aside from the backend, the C standard rules for [implicit conversion](https://en.cppreference.com/w/c/language/conversion) should apply to the data buffer, i.e. adding a pyarrow int8 array to a NumPy uint64 array should produce a pyarrow uint64 array.

For more expensive conversions, pandas retains the right to throw warnings or even error out when two of the same logical type with differing physical types is added. For example, attempting to do string concatenation of string arrays backed by pyarrow and Python objects may throw a ``PerformanceWarning``, or maybe even a ``MemoryError`` if such a conversion exhausts the available system resources.

The ``BaseType`` proposed above also has a property for the ``missing_value_marker``. Operations that use two logical types with different missing value markers should raise, as there is no clear way to prioritize between the various sentinels.

## PDEP-11 History
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## PDEP-11 History
## PDEP-13 History


- 27 April 2024: Initial version