Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDEP-13: The pandas Logical Type System #58455
base: main
Are you sure you want to change the base?
PDEP-13: The pandas Logical Type System #58455
Changes from 2 commits
b2eb5e6
8a24697
e7e6eb7
cc427ec
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xref #24662
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC we already support all of these, just not with all three of the storages. Is this paragraph suggesting that we implement all of these for all three (four if we include sparse) storages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at all - this is just suggesting that these be the logical types we offer, whether through the factory functions or class-based design that follows.
In today's world it is the responsibility of the user to know what may or may not be offered by the various backends and pick from them accordingly. With dates/datetimes being a good example, a user would have to explicitly pick from some combination of
pd.ArrowDtype(pa.date32())
,pd.ArrowDtype(pa.timestamp(<unit>))
anddatetime64[unit]
since we don't offer any logical abstraction.In the model of this PDEP, we would offer something to the effect
pd.date()
andpd.datetime()
, abstracting away the type implementation details. This would be helpful for us as developers to solve things like #58315There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this correspond to pyarrow's decimal128?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either decimal128 or decimal256
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course those names don't have to match exactly with the eventual APIs, but I would personally just call this "Decimal" (which maps to both Python and Arrow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you call it "Decimal", is there possible confusion with the python standard library
decimal
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm even if there is I'm not sure that would matter. Do you think its that different today from the distinction between a python built-in
int
versus a "int64"?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, because the python
decimal
type lets you specify any number of digits for the precision, as well as rounding.So I don't know what the correspondence between pyarrow's decimal128 and decimal256 is. Just concerned that if you name this "decimal", there is potential confusion for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea the arrow types also have a user-configurable precision / scale. The decimal128 variant offers up to 38 units of precision and 256 goes up to 76.
Taking a step back from Python/Arrow I think the problem you are hinting at is not new. Databases like postgres also offer Decimal types with different internal representations. So there might be some translation / edge cases to handle but generally I think calling it Decimal as a logical type signals the intent of the type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One can argue the same for "integer" which for Python is of unlimited size, while our ints are bound by the max of int64. In the case of the
decimal
stdlib module, there a few more differences, but regardless from that, if we would add a decimal / fixed point dtype to pandas in the future, I would personally still call it "decimal" (also because of python using that name, regardless of some subtle differences)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Categorical
instead? Because "dictionary" in Arrow terms is rather the physical type, not logical (like the string column could be dict encoded to save space, but still represent strings). While categorical (like pandas does it now) has specific semantics (i.e. fixed set of allowed values)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong preference, but
Dictionary
has the advantage of being more generic. WithCategorical
as a logical type I still think we'd end up offering a separateDictionary
logical type, which may or may not be a bad thingThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the point of a Dictionary logical type in addition to Categorical? When would one use that?
Suppose for the case of having dict encoded string values, either you just want to see those as strings, and then this is the String logical type, or either they define a strict set of categories and then this is the Categorical logical type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a logical Dictionary type could offer support for things like Python dictionaries or JSON structures that would seem an odd fit through the lens of a Categorical type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK, we are talking about completely different things then ;) I was assuming you were using "dictionary" in the Arrow sense, but so you are using it in the python sense (in Arrow that is called a
map
type).I am fine with in the future having a map/dictionary type (I also don't think this list necessarily needs to be exhaustive, we can also focus on those types for which we have somewhat support nowadays).
But then I do think that "Categorical" is missing in the list above (and now using the term as our existing CategoricalDtype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK makes sense. Yea I agree let's add Categorical separately then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW i also thought this referred to a Categorical-like dtype.
there is a JSONDtype in the extension tests (im guessing you implemented it long ago) that IIRC has a ton of xfails. id suggest getting that closer to fully-working before committing ourselves to first-class support for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear I don't think we want to offer a first class logical JSON type, but a user can construct that from a combination of the list / dictionary types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sidenote, I think in other systems JSON is typically physically stored as a string (so that's a bit different from what the current test JSONDtype does)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this then get mixed into ExtensionDtype? if not, what is this used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it depends on whether we as a team want to go down the path of offering constructors like
pd.string()
orpd.StringDtype()
; with the former approach this would not mix into the ExtensionDtype, since that would be closer to a physical type. If we went with the latter approach, we would be repurposing / clarifying the scope of the pandas extension type system to be totally logicalThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I think through it more it may make sense for these to be separate; the existing ExtensionDtype refers to a structure that pandas wants to contain a particular set of values. For extension types it is possible for there not to be a logical type representation in pandas
Maybe this should be renamed to
BaseLogicalType
to clairfyThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then im unclear on what we're doing with it? and if EA authors are supposed to implement something for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point - for third party EAs we would not be able to offer a logical type, since we don't know what the EA is. Let's see what other teammates think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea that's where my head is at - essentially they would see / interact with a subclass of the
BaseType
shown hereYea this is very tricky. I don't know what the best answer is but I don't really want to muddy the waters between an Extension type and a Logical type; I think these are two totally different things. But this PDEP would need to clarify further how users would use Extension types knowing there is no equivalent logical type for them (or maybe we need to have them register themselves as a logical type?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't fully follow why they are totally different things. I know our own types are messy (with the mixture of numpy, numpy-based extension dtypes and pyarrow extension dtypes), but for the external extension dtypes that I know, those are almost all to be considered as "logical" types (eg geometries, tensors, ..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because our third party extension arrays give EA authors theoretical control over data buffers, masks, and algorithms. A logical type should have none of these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's definitely true that in that sense an EA is also a physical type. But at the same time, in the idea that a user generally specifies the logical type, all external EAs are in practice also a logical type.
So I understand that those are different concepts, but I think in practice the external EA serves both purposes, and so what I don't fully see is 1) how it would be possible for external EAs to distinguish those concepts, and 2) what the value would be of doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your point 1) would definitely need to be developed further. Off the top of my head maybe an extension array would need to define an attribute like
logical_type
going forward. If the logical type if "foo", we would let that be constructed usingpd.foo
orpd.FooDtype
I think the value to 2) is that we could in theory also allow a third party EA to make its way into our logical type system. There are no shortage of string implementations that can be surmised, maybe there is some value in having an EA register yet another "string" logical type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there use cases for this other than the pyarrow strings? i.e. is it really needed for the general case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Other areas where I think it would matter would be:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, thanks. is there a reason to include the decimalN case but not (int|float|uint)N? is that distinct from itemsize?
its still unclear to me why this is needed in the general case. the way it is presented here suggests this is something i access on my dtype object, so in the DatetimeTZDtype case i would check dtype.physical_type.tz instead of dtype.tz? or i could imagine standardizing the dtype constructors so you would write
dtype = pd.datetime(physical_type=something_that_specifices_tz)
instead ofdtype=pd.datetime(tz=tz)
. Neither of these seems great, so i feel like I'm missing something.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question about the decimal case. If coming from a database world its not uncommon to have to pick between different integer sizes, but I think rare to pick between different decimal sizes. With decimals you would usually just select your precision, from which we could dispatch to a 128 or 256 bit implementation...but yea I'm not sure the best way to approach that
Doing something like
pd.datetime(physical_type=something_that_specifies_tz)
is definitely not what this PDEP suggests users should do. This PDEP suggests the 99% use case would bepd.datetime(tz=tz)
and a user should not care if we use pandas or pyarrow underneath that.Using #58315 as an example again, if we used the
ser.dt.date
accessor on a series with a logical type ofpd.datetime()
, then the user should expect back apd.date()
back. That gives us a lot of flexibility as library authors - maybe we still wantpd.datetime()
to be backed by our datetime extension by default, but could then use pyarrow's date implementation to backpd.date()
by default rather than having to build that from scratchIn today's world we are either asking users to explicitly use
pa.timestamp()
to get back a true date type fromser.dt.date
, or they use the traditional pandas extension type and get backdtype=object
when they use the accessorThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems reasonable, but then im not clear on why physical_type is in the base class.
Doesn't this then give the user mixed-and-matched propagation behavior that we agreed was unwanted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you referring to the
physical_type
keyword or something else? Every logical type instance should have a physical type underneath. So the "string" type could be physically implemented as apa.string()
,pa.large_string()
,object
, etc...The "datetime" type could be physically implemented as apa.timestamp
or the existingpandas extension datetime
. Right now a decimal would only be backed by either apa.decimal128
orpa.decimal256
In the logical type world that is totally fine. It is up to our implementation to manage the fact that a datetime can be implemented either as a
pa.timestamp
or our own datetime type, and a date can only be implemented as apa.date32
, but that is not a concern of the end user. I am definitely not suggesting we develop a non-pyarrow date typeThe feedback is very helpful and its good to flesh this conversation out so no worries. Much appreciated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I certainly understand the idea of a physical type, I am not sure such a simple attribute will fit for all cases. For example for our own masked Int64Dtype, what's the physical type?
np.int64
? (but that's only for the data array, while there is also a mask array)And is this physical type then always either a numpy type or pyarrow type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would consider the physical type in that case to be the
pd.core.arrays.integer.Int64Dtype
but maybe we should add terminology to this PDEP?In the scope of this document a logical type has no control over the layout of a data buffer or mask; it simply serves as a conceptual placeholder for the idea of a type. The physical type, by contrast, does have at least some control how its data buffer and mask are laid out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "more clarity" request was for what we do with BaseDtype. I guess examples for both usage and construction with, say,
pd.DatetimeTZDtype(unit="ns", tz="US/Pacific")
would also be helpful.You're saying mixed semantics are fine here, but in #58315 you agreed we should avoid that. What changed? Having consistent/predictable semantics is definitely a concern of the end user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be as simple as:
Or
I think interoperating between pyarrow / numpy / our own extension types behind the scenes is perfectly fine. It would be problematic to have those implementation details spill out into the end user space though.
When a user says "I want a datetime type" they should just get something that works like a datetime type. When they then say
ser.dt.date()
they should get back something that works like a date. Having logical types works as a contract with the end user to promise we will deliver that. Whether we use NumPy, pyarrow, or something else underneath to do that is no concern of the end userThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in most places we call this "na_value". Is there a reason to use a different name here? is
pd.NA|np.nan
here just for the example's sake or is that a restriction you want to impose? What about None? Could a 3rd party EA author do something else?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No - happy to change the name. I would rather not expand the scope beyond
pd.NA
ornp.nan
. I think that if we were starting over from scratch we wouldn't even have this field and just have one universally consistent NA marker that was not configurable. With that said, given the history of development I feel like we will have to continue to support both of these for the immediate future.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the main comments I was going to give: IMO this should not be configurable, even though it is true we live currently in this non-ideal world of multiple missing sentinels.
And it might makes things also more complex / dependent on other discussions, but I think for designing those new logical types we should somewhat assume we will in parallel also do a PDEP on converging on a single NA value for missing sentinel (it will actually makes things simpler for the design in this PDEP, it's just a bit a mess to discuss / implement all that on the short term because of the interdependencies).
Even if on the short term we have logical types that have either NA or NaN or NaT, I don't think that it should be something that the user can specify like in your example below (
pd.string(missing_value_marker=np.nan)
). I think generally the missing value sentinel will depend on the physical type or "backend" or however to call it (like if you choose a pyarrow type, then that determines the missing value marker). So we can still have this attribute (i.e. the currentna_value
already does this), but it's just the query the information, not to set by the user.If we want to move to those new logical types, we will anyway have a period where we have both the current types and the new types. So if we in parallel decide on a NA PDEP, it's also an option that those new logical types just all use NA. Whether that is feasible of course depends on the timeline of both developments (like ideally the string logical dtype for 3.0 would also already have used NA, but it does not because otherwise we could not yet introduce it by default right now)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this comment, but we have already implemented this in a way where we disconnect the "backend" from the NA marker with types like "pyarrow_numpy". I agree users ideally won't be touching this field (maybe it can be a read-only property?).
If we removed this altogether I think we would lose the ability to support a transition period for dtypes like
pyarrow_numpy
and thepython_numpy
string type you are working on, but I may be missing the bigger picture on that.I would definitely support another PDEP to unify our missing value handling, but would be wary of trying to tackle two large PDEPs at once. With this field being a part of the logical type proposal, I think it helps us support all of the existing types we have already implemented and give us better flexibility to control / deprecate the various NA markers we allow today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're certainly right that allowing both makes a transition easier, as is what we are doing with the string dtype. And depending on the timelines of both changes (logical types / NA), it might indeed be needed for other dtypes as well.
So I think my main point is that we shouldn't show this as a "parameter" of the dtype construction, but just as a property of a certain dtype instance.
Essentially what we did for the string dtype was to call this a different "backend" (although multiple string dtype instances with different null value might both be using pyarrow or numpy, but the "backend" argument maps to the concrete EA Array class that is being used, and those are actually different), so it is tied to a specific implementation, and not user-specified. And although this is very ugly with the different strange backend names, I think this is a more future-proof option if we think longer term the NA value should not be user configurable.
It might be nicer / clearer to show this in the StringDtype repr as a separate property instead of being merged in the backend name (so eg
StringDtype(storage="python", na_value=NA)
/StringDtype(storage="python", na_value=NaN)
instead ofStringDtype(storage="python")
/StringDtype(storage="python_numpy")
(because what does "python_numpy" mean to a user.. ?).But at the same time I also wouldn't want to embed too much the idea that
na_value
is user configurable, it's just that for the transition period we need a way to be able to specify both versions of the dtype.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that
na_value
is not for users of the pandas library, but for developers of the pandas library. I think it should be read-only, and possibly a "hidden" property (prefix with an underscore)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely support both of what @jorisvandenbossche and @Dr-Irv are saying here. I think where I am getting tripped up is how we allow users today to swap over to this paradigm.
If someone had a code base where they do
ser = pd.Series(..., dtype="string[pyarrow_numpy]")
and elsewhere maybe justser = pd.Series(..., dtype="string[pyarrow]")
, how do we think they can port to this new design? Or do we not think that is valuable?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this is a dtype, I'd prefer having
Dtype
in the name. That favors the class based approach. But the factory function approach could be used as well, but I'd preferpd.stringDtype()
overpd.string()
to make it clear that it's a dtype.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool - I think @jorisvandenbossche has the same preference. So happy to start focus on that pattern more.
The only concern I have is that we are repurposing something that has long existed.
pd.Int64Dtype()
gets a new meaning and may no longer return the pandas extension type we created...but maybe that isn't a bad thingThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont have a strong preference, but do think it will make the discussion easier to focus on just one variant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright I'll stick with that in the next revision. We can always revisit if that doesn't work out for whatever reason. Thanks for the feedback all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, my personal preference would be to go with the shorter functional constructors (
pd.string()
,pd.int64()
, etc).It's easier to use, less boilerplate, generally more similar to other libraries (this was my impression, but looking up a few examples, that is not necessarily true), and ties less into the exact Dtype class (in case two variants of one logical type want to use a different dtype class, although I am not sure this is necessarily something we want, though)
But agreed that for now, I think you can just use one throughout the document to explain everything, and then I would add one section on this topic mentioning some options and why which choice was made (and we can still further bikeshed on the exact syntax, but that feels like an actual bikeshed separate from the core ideas of the PDEP)
(@Dr-Irv if you want "dtype" in the name, yet another alternative would be to put the functions in a namespace, like
pd.dtypes.string()
. I seem to remember this also came up when I discussed topic in Basel last year)Well, we still need to discuss what those logical type constructors would actually return ;) In my mind, it could continue to return our extension types (as I wouldn't necessarily distinguish user-facing extension types and logical types), and then
pd.Int64Dtype()
would not necessarily have to change return value .. (except that it might gain additional keyword to e.g. get the pyarrow-backed integer dtype)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works for me. I'm not a big fan of continually adding things to the top level
pandas
namespace.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
off the top of my head, this would mean deprecating M8[unit], m8[unit], timedelta64[unit], period[freq], interval, interval[subtype, closed], category. (the first three of which are supported by
np.dtype(...)
). Just checking that these are all intentional.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely would want to deprecate using the NumPy types since those would be circumventing the logical type system. I am -0.5 on continuing to allow timedelta64 because that also refers directly to the extension array implementation name rather than a logical type.
Period seems logical to continue supporting. I don't think I've ever constructed an interval that way before so I'll withhold an opinion on that for now, but generally these are offered for legacy support and not the ideal way of creating types going forward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also some of the discussion in #58243.
But IMO it's a very hard step to actually deprecate specifying numpy dtypes, given how widespread this usage is. And certainly for those types where there is an unambiguous mapping to our new logical types (like np.int64), I am not fully sure what the benefit is about deprecating that.
In any case, if we would deprecate that, it would need to go very slowly, and we should only do that a while after introducing the new pandas types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benefit would be having some consistency in the type constructors we allow and it would decouple the user focus on something have to be NumPy.
I agree it would be a very long deprecation, but on the flip side I don't think a user really believes they are subscribing into any particular set of features when they ask for a
np.int64
data type. They really just want a 64 bit integer.Especially with integer types being zero copy there is very little benefit to asking for a
np.int64
, and I would argue that is even counterproductive with the possibility of missing values being introducedThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaning towards thinking deprecating these would cause a lot of churn for users without much clear benefit. To the extent that you want to encourage users to use different patterns, I'm fine with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allowing users to construct physical types like
np.int64
alongside logical types likepd.int64
I think defeats the purpose of adding logical types though; in such a case we are just adding more type construction options, which is a growing problem in our current state that this PDEP is trying to solveThere also is a clear benefit to
pd.int64
overnp.int64
, namely the former has flexibility to better handle missing values whereas the latter locks you into numpy semanticsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my mind, we wouldn't allow them to create a Series with a "physical type", we would just accept
np.int64
as input for thedtype
argument, and specifyingnp.int64
orpd.int64()
would give the exact same result: a Series with whatever is the default int64 dtype at that point (right now that is the numpy based int64, in the future maybe a numpy masked int64 or pyarrow int64).It's certainly true that this doesn't help with the "too many ways" to specify a dtype, and this PDEP would add yet another way, but at least a way that is consistent across all dtypes, and which ensures there is a clear mapping of whathever random input to the resulting dtype.
For example,
np.int64
is actually not even a dtype but a scalar constructor (np.dtype("int64")
is the dtype), but conveniently accepted as a dtype argument in numpy. Just like the python types are also accepted, in addition to strings.So with the logical types, you can say that (for now) all of
np.int64
,np.dtype("int64")
,"int64"
,int
,"int"
all map to the same aspd.int64()
, and longer term we could discuss reducing the number of aliases being accepted.(I didn't do that with this comment .. ;), but personally I think we should maybe better focus first on the actual (logical) dtypes (the new part in this PDEP), and we can still later consider which aliases should be allowed or not. Short term we need to keep them all working anyway)
EDIT: numpy has very extensive documentation about which values are exactly accepted as dtype descriptors/aliases at https://numpy.org/devdocs/reference/arrays.dtypes.html#specifying-and-constructing-data-types, we should probably do something similar for pandas at some point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's fair. Providing compat is going to be a fact of life for this. Ultimately I don't think it would be that hard for a user to swap out
np.int64
for the logical constructor (in such an instance it would just be find/replace), but yea I agree that can be something cleaned up over time. Its not critical to execution of this PDEPThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to address if people can convert between two physical types that back the same logical type?
below, you discuss where operations between 2 objects of different physical types, but same logical type, cause a priority in the result. But what if users want to convert from one physical type to another?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great callout - I assume you have in mind something like adding a logical int + float.
I think the same rules should still apply, i.e. if one is physically a pyarrow int and the other a numpy float you should end up with logical float type backed by a physical pyarrow float
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or if I have a numpy backed string and I want to convert to a pyarrow backed string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those would both be under the logical string type, so if you tried to concatenate them together you would get a pyarrow string per the rules outlined here. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I'm saying that let's say you know that you have a
Series
of logical typestring
, and you also know that it is backed bynumpy
string, and you want to convert thatSeries
to have a physical type ofarrow
. Don't you need a public API that lets you convert from one physical backing to another?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea if we move to a place where the
DatetimeIndex
is backed by the logical datetime type, then it becomes an implementation detail for us to control how the two different physical types need to fulfill the requirements of a logical datetime type. I think since the pyarrow / NumPy datetime types are zero-copy save for NaT handling the internal development for that should be a lot less burdensome than having a separateDatetimeIndex[pyarrow]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No - there would be no need at all for that. In this particular case, if the logical requirement was to go from a datetime -> date, and physically our array happened to be backed by a NumPy array, we internally would go NumPy datetime -> pa.timestamp -> pa.date, with the pa.date being exposed as a
pd.date
type; if the array was already backed by pyarrow we could skip the first stepI think even if PDEP-10 got reverted it would limit the logical types we offer (we'd essentially scrap date, list, dictionary, struct, etc...anything where only pyarrow implements something) until pyarrow becomes a hard dependency again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're definitely talking past each other, because AFAICT the suggestion here directly contradicts discussion in #58315. Let's put a pin in this until the call on Wednesday and see if we can sort it out then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summarizing this part of the discussion from the call: the communication failure was in 1) what "mixed semantics" I was concerned about and 2) what Will meant by not needing to implement anything new.
Consider
Now suppose that df["date"] is backed (perhaps indirectly) by a pyarrow array. When I say the columns have "mixed semantics", I am referring to the propagation of nulls in comparisons
In this scenario
mask1
andmask2
are both nullable and their second entry is pd.NA. Using these masks for indexing gives surprising results (raising ATM, different if the Ice Cream Agreement goes into effect) to users expecting numpy semantics. A user who has not opted in to null propagation has some columns that behave this way and some that do not. Avoiding that scenario is why we implemented "string[pyarrow_numpy]" to have arrow storage with numpy semantics.OK, all of that is what I had failed to communicate. What Will was saying when he said (paraphrasing) "we wouldn't need to implement another dtype", that didn't exclude a thin wrapper to retain numpy propagation semantics with pyarrow storage.
@WillAyd can you confirm that the above is a valid summary of what we got sorted out at the end of the call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea that's correct and a nice summary. This is why the BaseType in the PDEP has the
missing_value_marker
/na_value
- it explicitly decouples the NA handling from the physical type. It specifically allows the logical type to cover the functionality ofstring[pyarrow_numpy]
as you described above, and as we talked about on the call theoretically also allows you to have a logical date type which uses pyarrow behind the scenes but can presentpd.NaT
as thena_value
to the end userI think this PDEP could help achieve more consistent NA handling over time since it decouples the user expectations from a "backend", but actually propogating one NA marker consistently is left to a separate PDEP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"constructs" here seems weird. do you mean when two objects with different storages interact? i.e. ser_pyarrow + ser_numpy? If so, I agree with the ordering below, and note that it can be governed by
__pandas_priority__
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the ordering as well, but I would be supportive of just raising as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe call this "masked"? calling it "pandas" seems to preference it as being canonical
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does
pandas extension
make more sense? Masked is maybe too specific, since this also refers to things like dictionary-encoded / categoricalThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.