Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDEP-13: The pandas Logical Type System #58455
base: main
Are you sure you want to change the base?
PDEP-13: The pandas Logical Type System #58455
Changes from 3 commits
b2eb5e6
8a24697
e7e6eb7
cc427ec
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xref #24662
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC we already support all of these, just not with all three of the storages. Is this paragraph suggesting that we implement all of these for all three (four if we include sparse) storages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at all - this is just suggesting that these be the logical types we offer, whether through the factory functions or class-based design that follows.
In today's world it is the responsibility of the user to know what may or may not be offered by the various backends and pick from them accordingly. With dates/datetimes being a good example, a user would have to explicitly pick from some combination of
pd.ArrowDtype(pa.date32())
,pd.ArrowDtype(pa.timestamp(<unit>))
anddatetime64[unit]
since we don't offer any logical abstraction.In the model of this PDEP, we would offer something to the effect
pd.date()
andpd.datetime()
, abstracting away the type implementation details. This would be helpful for us as developers to solve things like #58315There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this correspond to pyarrow's decimal128?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either decimal128 or decimal256
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course those names don't have to match exactly with the eventual APIs, but I would personally just call this "Decimal" (which maps to both Python and Arrow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you call it "Decimal", is there possible confusion with the python standard library
decimal
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm even if there is I'm not sure that would matter. Do you think its that different today from the distinction between a python built-in
int
versus a "int64"?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, because the python
decimal
type lets you specify any number of digits for the precision, as well as rounding.So I don't know what the correspondence between pyarrow's decimal128 and decimal256 is. Just concerned that if you name this "decimal", there is potential confusion for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea the arrow types also have a user-configurable precision / scale. The decimal128 variant offers up to 38 units of precision and 256 goes up to 76.
Taking a step back from Python/Arrow I think the problem you are hinting at is not new. Databases like postgres also offer Decimal types with different internal representations. So there might be some translation / edge cases to handle but generally I think calling it Decimal as a logical type signals the intent of the type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One can argue the same for "integer" which for Python is of unlimited size, while our ints are bound by the max of int64. In the case of the
decimal
stdlib module, there a few more differences, but regardless from that, if we would add a decimal / fixed point dtype to pandas in the future, I would personally still call it "decimal" (also because of python using that name, regardless of some subtle differences)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this then get mixed into ExtensionDtype? if not, what is this used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it depends on whether we as a team want to go down the path of offering constructors like
pd.string()
orpd.StringDtype()
; with the former approach this would not mix into the ExtensionDtype, since that would be closer to a physical type. If we went with the latter approach, we would be repurposing / clarifying the scope of the pandas extension type system to be totally logicalThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I think through it more it may make sense for these to be separate; the existing ExtensionDtype refers to a structure that pandas wants to contain a particular set of values. For extension types it is possible for there not to be a logical type representation in pandas
Maybe this should be renamed to
BaseLogicalType
to clairfyThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then im unclear on what we're doing with it? and if EA authors are supposed to implement something for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point - for third party EAs we would not be able to offer a logical type, since we don't know what the EA is. Let's see what other teammates think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea that's where my head is at - essentially they would see / interact with a subclass of the
BaseType
shown hereYea this is very tricky. I don't know what the best answer is but I don't really want to muddy the waters between an Extension type and a Logical type; I think these are two totally different things. But this PDEP would need to clarify further how users would use Extension types knowing there is no equivalent logical type for them (or maybe we need to have them register themselves as a logical type?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't fully follow why they are totally different things. I know our own types are messy (with the mixture of numpy, numpy-based extension dtypes and pyarrow extension dtypes), but for the external extension dtypes that I know, those are almost all to be considered as "logical" types (eg geometries, tensors, ..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because our third party extension arrays give EA authors theoretical control over data buffers, masks, and algorithms. A logical type should have none of these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's definitely true that in that sense an EA is also a physical type. But at the same time, in the idea that a user generally specifies the logical type, all external EAs are in practice also a logical type.
So I understand that those are different concepts, but I think in practice the external EA serves both purposes, and so what I don't fully see is 1) how it would be possible for external EAs to distinguish those concepts, and 2) what the value would be of doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your point 1) would definitely need to be developed further. Off the top of my head maybe an extension array would need to define an attribute like
logical_type
going forward. If the logical type if "foo", we would let that be constructed usingpd.foo
orpd.FooDtype
I think the value to 2) is that we could in theory also allow a third party EA to make its way into our logical type system. There are no shortage of string implementations that can be surmised, maybe there is some value in having an EA register yet another "string" logical type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there use cases for this other than the pyarrow strings? i.e. is it really needed for the general case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Other areas where I think it would matter would be:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, thanks. is there a reason to include the decimalN case but not (int|float|uint)N? is that distinct from itemsize?
its still unclear to me why this is needed in the general case. the way it is presented here suggests this is something i access on my dtype object, so in the DatetimeTZDtype case i would check dtype.physical_type.tz instead of dtype.tz? or i could imagine standardizing the dtype constructors so you would write
dtype = pd.datetime(physical_type=something_that_specifices_tz)
instead ofdtype=pd.datetime(tz=tz)
. Neither of these seems great, so i feel like I'm missing something.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question about the decimal case. If coming from a database world its not uncommon to have to pick between different integer sizes, but I think rare to pick between different decimal sizes. With decimals you would usually just select your precision, from which we could dispatch to a 128 or 256 bit implementation...but yea I'm not sure the best way to approach that
Doing something like
pd.datetime(physical_type=something_that_specifies_tz)
is definitely not what this PDEP suggests users should do. This PDEP suggests the 99% use case would bepd.datetime(tz=tz)
and a user should not care if we use pandas or pyarrow underneath that.Using #58315 as an example again, if we used the
ser.dt.date
accessor on a series with a logical type ofpd.datetime()
, then the user should expect back apd.date()
back. That gives us a lot of flexibility as library authors - maybe we still wantpd.datetime()
to be backed by our datetime extension by default, but could then use pyarrow's date implementation to backpd.date()
by default rather than having to build that from scratchIn today's world we are either asking users to explicitly use
pa.timestamp()
to get back a true date type fromser.dt.date
, or they use the traditional pandas extension type and get backdtype=object
when they use the accessorThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable, but then im not clear on why physical_type is in the base class.
Doesn't this then give the user mixed-and-matched propagation behavior that we agreed was unwanted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you referring to the
physical_type
keyword or something else? Every logical type instance should have a physical type underneath. So the "string" type could be physically implemented as apa.string()
,pa.large_string()
,object
, etc...The "datetime" type could be physically implemented as apa.timestamp
or the existingpandas extension datetime
. Right now a decimal would only be backed by either apa.decimal128
orpa.decimal256
In the logical type world that is totally fine. It is up to our implementation to manage the fact that a datetime can be implemented either as a
pa.timestamp
or our own datetime type, and a date can only be implemented as apa.date32
, but that is not a concern of the end user. I am definitely not suggesting we develop a non-pyarrow date typeThe feedback is very helpful and its good to flesh this conversation out so no worries. Much appreciated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I certainly understand the idea of a physical type, I am not sure such a simple attribute will fit for all cases. For example for our own masked Int64Dtype, what's the physical type?
np.int64
? (but that's only for the data array, while there is also a mask array)And is this physical type then always either a numpy type or pyarrow type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider the physical type in that case to be the
pd.core.arrays.integer.Int64Dtype
but maybe we should add terminology to this PDEP?In the scope of this document a logical type has no control over the layout of a data buffer or mask; it simply serves as a conceptual placeholder for the idea of a type. The physical type, by contrast, does have at least some control how its data buffer and mask are laid out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "more clarity" request was for what we do with BaseDtype. I guess examples for both usage and construction with, say,
pd.DatetimeTZDtype(unit="ns", tz="US/Pacific")
would also be helpful.You're saying mixed semantics are fine here, but in #58315 you agreed we should avoid that. What changed? Having consistent/predictable semantics is definitely a concern of the end user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be as simple as:
Or
I think interoperating between pyarrow / numpy / our own extension types behind the scenes is perfectly fine. It would be problematic to have those implementation details spill out into the end user space though.
When a user says "I want a datetime type" they should just get something that works like a datetime type. When they then say
ser.dt.date()
they should get back something that works like a date. Having logical types works as a contract with the end user to promise we will deliver that. Whether we use NumPy, pyarrow, or something else underneath to do that is no concern of the end userThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel wanted to ping this off you. When we talked I said would stick with np.nan as the missing value marker, but since we are reusing the
pd.StringDtype()
in this version of the proposal I don't think we should move that away frompd.NA
Its not the default but maybe this
pd.na_marker="legacy"
serves as a good middle ground?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there's a global option, it should go through pd.set_option/get_option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why data_manager instead of storage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason I wanted to include the word data is because we are strictly referring to the data buffer and nothing else. I don't have a strong preference on the terminology of this though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i feel like you're not engaging with what joris and i tried to explain about "you need to do it all at once" on Wednesday's call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry missed this before - what do you think is missing in this section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section dramatically expands the scope of this PDEP, to encompass a lot of what would be PDEP-NA. If you want to make PDEP-NA, great, go for it. Until then, this one should either a) be tightly scoped so as to not depend on PDEP-NA, or b) be explicitly "this is on the back-burner until after PDEP-NA". I prefer the first option, but you do you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is your objection just to the default of pd.NA or to the more general approach of allowing na_marker to allow different nullability semantics while still using the same logical type? I am indifferent if we felt the need to start with the "legacy" approach of mixed sentinels, but I think the latter na_marker is a critical piece to execute this PDEP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is a deal-breaker for me until we have a dedicated PDEP about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think the na_marker="legacy" makes sense? Curious to know if you think the NA marker for logical types that have no NumPy equivalent (ex: ListDtype) should still be np.nan or pd.NA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bikeshedding naming aside, I'd be fine with that as long as it is the default until we do an all-at-once switchover following PDEP-NA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Period, timedelta64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I can make this a fully exhaustive list. For the Legacy missing value on
BinaryDtype
,ListDtype
,StructDtype
andMapDtype
do you thinknp.nan
orpd.NA
makes more sense?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.