Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDEP-15: Change status of PDEP-10 to rejected #58623

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

lithomas1
Copy link
Member

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Just something quick and dirty I threw up that we should talk about tomorrow at the dev call.

@lithomas1 lithomas1 added the PDEP pandas enhancement proposal label May 7, 2024
@lithomas1
Copy link
Member Author

/preview

Copy link
Contributor

github-actions bot commented May 7, 2024

Website preview of this PR available at: https://pandas.pydata.org/preview/pandas-dev/pandas/58623/

@lithomas1
Copy link
Member Author

Ugh, looks like the bullets aren't rendering correctly.

@WillAyd
Copy link
Member

WillAyd commented May 7, 2024

Did I miss an official vote on rejecting this? I am not sure yet that I would want to reject, and am still leaning towards keeping in spite of some negative feedback

@lithomas1
Copy link
Member Author

Nope, just opening since I said I would in the discussion issue.

We'll still need a formal vote - I'm just kicking off the discussion here.


2) Many of the benefits presented in this PDEP can be materialized even with payrrow as an optional dependency.

For example, as detailed in PDEP-14, it is possible to create a new string data type with the same semantics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PDEP 14 does not change performance or memory savings if you do not have pyarrow installed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a note in parentheses at the end of that sentence.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you push this up? I don't see anything in parentheses.

The way I am interpreting this now is "we don't need/care for pyarrow strings because we have always had a string data type using Python strings" - is that correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the PDEP-15 text, and forgot to remove the PDEP-10 changes.

I've removed the PDEP-10 changes now.

The primary reasons for rejecting this PDEP are twofold:

1) Requiring pyarrow as a dependency causes installation problems.
- Pyarrow does not fit or has a hard time fitting in space-constrained environments
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we could learn from this process is what caused this to change our minds? These issues were discussed leading up to the acceptance of PDEP-10.

The way this is written I think reads more as "we discovered this after the fact" instead of "we decided that X amount of negative feedback on these points was enough to revert". I think there is some value to future PDEPs to set expectations around the latter

@lithomas1 lithomas1 marked this pull request as ready for review May 8, 2024 15:09
@lithomas1
Copy link
Member Author

cc @pandas-dev/pandas-core @pandas-dev/pandas-triage

xref #57073 (comment) for context

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lithomas1 for making the updates needed to formally reject PDEP-10.

web/pandas/pdeps/0010-required-pyarrow-dependency.md Outdated Show resolved Hide resolved
@Dr-Irv
Copy link
Contributor

Dr-Irv commented May 8, 2024

As discussed in dev meeting on 5/8/24, suggestion is to do a new PDEP that reverts PDEP-10, and keeps any parts we want to keep.

@simonjayhawkins
Copy link
Member

I am not sure yet that I would want to reject, and am still leaning towards keeping in spite of some negative feedback

I'm now leaning towards approving the rejection. My approval of the original PDEP was based solely on improvements to default inference for other dtypes. Despite some recent comments about this, no discussion/clarification has followed on this topic. I'd need to see some positive evidence that the original PDEP-10 authors still intend to support delivering the promised enhancements in this area. Now that the implications of using pd.NA as a default has been discussed in more depth, I suspect that any improved inference would need a couple of dtype variants.

@lithomas1
Copy link
Member Author

As discussed in dev meeting on 5/8/24, suggestion is to do a new PDEP that reverts PDEP-10, and keeps any parts we want to keep.

Yep, I'm planning on updating this current PR to do that, so if anyone has any objections or whatever, we can still discuss here.

@WillAyd
Copy link
Member

WillAyd commented May 18, 2024

Minor note - do we need to rename this PR? Right now PDEP-10 shows twice on the website

image

@lithomas1
Copy link
Member Author

Yeah, I'll probably change the name to PDEP-15 once I get around to moving this to a separate PDEP (probably tomorrow).

I was travelling the past week, so didn't really have time then.

@lithomas1 lithomas1 changed the title PDEP-10: Change status to rejected PDEP-15: Change status to rejected May 19, 2024
The primary reasons for rejecting this PDEP are twofold:

1) Requiring pyarrow as a dependency causes installation problems.
- Pyarrow does not fit or has a hard time fitting in space-constrained environments
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within the context of recent conversation I don't think this comment about AWS is true. AWS distributes an official pandas image for lambda which already includes pyarrow, pandas, and NumPy. This is all required by their own "AWS SDK on pandas" library.

The issue more finely scoped I think is that the default wheel installation via pip into a lambda image exceeds the 256 MB limit. Either using the official AWS provided image or using miniconda should not exceed the space limits


2) Many of the benefits presented in this PDEP can be materialized even with payrrow as an optional dependency.

For example, as detailed in PDEP-14, it is possible to create a new string data type with the same semantics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you push this up? I don't see anything in parentheses.

The way I am interpreting this now is "we don't need/care for pyarrow strings because we have always had a string data type using Python strings" - is that correct?

@Dr-Irv Dr-Irv changed the title PDEP-15: Change status to rejected PDEP-15: Change status of PDEP-10 to rejected May 20, 2024
Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main comment is that PDEP-10 should be minimally modified, and that PDEP-15 has all the discussion about why we did the rejection.

web/pandas/pdeps/0010-required-pyarrow-dependency.md Outdated Show resolved Hide resolved
web/pandas/pdeps/0015-do-not-require-pyarrow.md Outdated Show resolved Hide resolved
web/pandas/pdeps/0015-do-not-require-pyarrow.md Outdated Show resolved Hide resolved
web/pandas/pdeps/0015-do-not-require-pyarrow.md Outdated Show resolved Hide resolved
web/pandas/pdeps/0015-do-not-require-pyarrow.md Outdated Show resolved Hide resolved
While both of these reasons are mentioned in the drawbacks section of this PDEP, at the time of the writing
of the PDEP, we underestimated the impact this would have on users, and also downstream developers.

2) Many of the benefits presented in this PDEP can be materialized even with payrrow as an optional dependency.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally don't find this point very convincing. Saying Many of the benefits but then following it up with one bullet point seems to miss the mark - what are the other many benefits that we don't need pyarrow for? Without pyarrow users are forgoing:

  • High performance string operations
  • Direct string creation from I/O routines (i.e. no intermediate copies)
  • Zero copy data exchange through Arrow C Data Interface
  • Performant, memory efficient, and consistent NA handling

On the larger roadmap of pandas this moves us away from tighter Arrow integration, which means we move further away from Arrow compute algorithms / joins and the larger ecosystem of tools that includes streaming, query optimizers, planners, data engines, etc...

I think this argument in its current form is saying "we don't need a car because we have a horse and buggy"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally don't find this point very convincing. Saying Many of the benefits but then following it up with one bullet point seems to miss the mark - what are the other many benefits that we don't need pyarrow for? Without pyarrow users are forgoing:

  • High performance string operations
  • Direct string creation from I/O routines (i.e. no intermediate copies)
  • Zero copy data exchange through Arrow C Data Interface
  • Performant, memory efficient, and consistent NA handling

On the larger roadmap of pandas this moves us away from tighter Arrow integration, which means we move further away from Arrow compute algorithms / joins and the larger ecosystem of tools that includes streaming, query optimizers, planners, data engines, etc...

I think this argument in its current form is saying "we don't need a car because we have a horse and buggy"

In PDEP-10, there are 3 benefits listed

  1. pyarrow strings (possible to provide users this benefit without making pyarrow required)
  2. Nested datatypes (can't have this without arrow, but this is a bit niche)
  3. Interopability (the alternative is the dataframe interchange protocol, which is more widely adopted at the moment. Not sure about the zero-copy stuff for that, though. I think it also might be possible to implement Arrow C Data interface support without taking on a hard dep on pyarrow)
    • Also, the primary beneficiary of this is other dataframe libraries (as opposed to us).

So, IMO, this argument is accurate, in that most of the benefits in PDEP-10 can be made possible (for those user that have pyarrow installed) without making pyarrow required.

The future benefits of Arrow are very compelling, but decisions on making a dependency required should be based on immediate and not future benefits. Like I said before, it is easy to reconsider this decision in a years time if those future benefits are materialize.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think points 1 and 3 are possible without pyarrow then the alternatives for that should be laid out in this PDEP, at least at a super high level. I'm assuming point 1 refers to the nanoarrow POC I was sharing; point 3 requires reimplementing the conversions that pyarrow already has. (I personally don't think building either of those from scratch is a good long term solution but it can at least be discussed)

For point 2 how do you know those are niche applications? Its easy to dismiss things that don't exist today as not worthwhile, but I get the feeling that there could be plenty of use cases for the aggregate types, since they have a natural fit with many of the Python containers.

On interoperability the long term prospects for the dataframe interchange protocol seem dubious, and we have even discussed moving that out of pandas (see #56732).

  • Also, the primary beneficiary of this is other dataframe libraries (as opposed to us).

The Arrow interchange protocol can be used by any library that needs to work with Arrow data - there is no limit to it being used by other dataframe libraries. It provides a standardized API so that third parties don't need to hack into our internals, which is a direct benefit for us. It also works in two directions - we can be a consumer just as much as a producer.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nested datatypes (can't have this without arrow, but this is a bit niche)

Also wanted to point out that arrow has a decimal128 and decimal256 type which is especially useful for financial calculations where floating point inaccuracies cannot be tolerated, and the arrow decimal types are an extremely significant improvement over using object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will update and add a note in the PDEP when I get time again.

lithomas1 and others added 2 commits May 20, 2024 22:30
Co-authored-by: Irv Lustig <irv@princeton.com>
Co-authored-by: Irv Lustig <irv@princeton.com>
Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this PDEP, although I'm unsure whether using language as "we, the core team", should appear in a PDEP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PDEP pandas enhancement proposal
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants