Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task]: Update the minor version of google-cloud-storage library prior to Beam release. #27326

Closed
1 of 15 tasks
BjornPrime opened this issue Jun 30, 2023 · 18 comments
Closed
1 of 15 tasks
Assignees
Labels
done & done Issue has been reviewed after it was closed for verification, followups, etc. P3 python task

Comments

@BjornPrime
Copy link
Contributor

BjornPrime commented Jun 30, 2023

What needs to happen?

The current implementation of GCSIO uses an internal field google.cloud.storage.batch.Batch._responses from the GCS client. This is unlikely to be updated but still is potentially subject to change so the version has an upper limit set to avoid breakages. Please check if a new version of the GCS client has been released and increment the version in setup.py if it's compatible.

We can consider to close this issue when either condition is met:

  • Beam vendors the GCS client.
  • GCS client is updated to allow support for our use case without reference to internal methods.
  • GCSIO is switched to relying on another client.

Until then, don't close this issue, instead, move it to the next release milestone after updating the version in https://github.com/apache/beam/blob/master/sdks/python/setup.py

Issue Priority

Priority: 3 (nice-to-have improvement)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@lostluck
Copy link
Contributor

lostluck commented Aug 2, 2023

2.50 release manager here.
This issue is currently tagged for the 2.50.0 release, which cuts in a week on August 9th.

Please complete work and get it into the main branch in that time, or move this issue to the 2.51 Milestone: https://github.com/apache/beam/milestone/15

@tvalentyn
Copy link
Contributor

tvalentyn commented Aug 10, 2023

Are there any plans to have batch._responses in a public api?

@tvalentyn
Copy link
Contributor

@BjornPrime @cojenco did you by chance already discuss that Apache Beam has a dependency on Batch._responses ? Would it be possible to include it in a public api? It would be good to not have a risk of accidental breakages and additional overhead to change every minor version update if we can avoid it.

@kennknowles
Copy link
Member

2.51.0 release manager here. The release branch has already been cut. I might accept a cherrypick of this if it is vital, but dep upgrades are always high risk for making a release unusable.

@BjornPrime
Copy link
Contributor Author

BjornPrime commented Sep 21, 2023

Pushing the version ceiling to 2.11 will be needed if everything goes well with the GCS client migration (#28079) and we decide to cherry-pick that into the release. Otherwise, it can wait until next release.

@tvalentyn
Copy link
Contributor

@BjornPrime do you have an update on:

@BjornPrime @cojenco did you by chance already discuss that Apache Beam has a dependency on Batch._responses ?

@BjornPrime
Copy link
Contributor Author

I discussed that with my contact on the GCS team and they didn't seem to consider it a priority but I could push harder on it. It would make things easier for us and seems like an easy change (though I don't want to speak with certainty on that).

@lostluck
Copy link
Contributor

This has been punted for a few versions without any work AFAICT. I'm going to remove it from the release milestone. It can be re-added to another milestone if it's deemed release blocking (which feels unlikely as a P3).

@lostluck lostluck removed this from the 2.54.0 Release milestone Jan 17, 2024
@shunping
Copy link
Contributor

.take-issue

@shunping
Copy link
Contributor

shunping commented Jan 19, 2024

I just checked for the latest gcsio (after the recent migration), we still rely on the internal variable _responses mentioned in the description:

I believe we did that because we need to check the responses from running the batch request but the gcs client library does not provide a way to return this information.
https://github.com/googleapis/python-storage/blob/v2.14.0/google/cloud/storage/batch.py#L145

@shunping
Copy link
Contributor

shunping commented Jan 22, 2024

I talked to the team in charge of google-cloud-storage python library. They mentioned that they have put all the feature requests on the batch module on hold since there is a new API for batch operation under development in GCS.

This is both good news and bad news for us. On one hand, this means that for a little while we won't see any changes in this module so the way we referencing the private member variable will still work. On the other, when the new API is ready, it is very likely to introduce breaking changes.

I have already informed them to let us know when it happens, but at this moment, we don't have any action item.

Notice that, an alternative to get rid of referencing the private member variable is to do a hack as follows:

responses = None
current_batch = self.client.batch(raise_exception=False)
with current_batch:
    # call and put batch operations in queue
    # ...
    # call finish to get the responses
    responses = current_batch.finish(raise_exception=False)
    # throw a blank exception so finish() will not be triggered when exiting the with statement.
    raise Exception("nothing wrong, throwing an exception so finish() will not be called during __exit__()")

I don't think we will go with this hack given the incoming change on the batch module.

@liferoad
Copy link
Collaborator

No need to update the library for now. Close this.

@shunping
Copy link
Contributor

shunping commented Jan 23, 2024

I am fine with closing this as there is no action item for now. However, we will need to keep in mind that any upgrade of google-cloud-storage library may break our gcsio code, given we currently have the dependency of this private member variable.

I submitted a feature request to google-cloud-storage: googleapis/python-storage#1214 and will follow up with them.

@damccorm
Copy link
Contributor

@shunping could you add a comment with a link to this issue alongside the dependency in https://github.com/apache/beam/blob/master/sdks/python/setup.py? I'd like to avoid us bumping it without understanding the context.

Its also worth noting that we're actually at risk today. It looks like we're pinned to google-cloud-storage>=2.14.0,<3, but this could break if they release 2.15.0 without the internal field (and someone's other dependencies caused 2.15.0 to get installed). I think we can probably live with that risk since the fix should be relatively straightforward (force a lower version of the dependency), but it is a risk nonetheless.

@tvalentyn
Copy link
Contributor

this could break if they release 2.15.0 without the internal field

in this scenario, will the pipeline fail at runtime or at job submission ?

@tvalentyn
Copy link
Contributor

tvalentyn commented Jan 23, 2024

note that at runtime, an older version of cloud-storage client will likely be preinstalled in SDK containers. but if a pipeline would fail at job submission and we know it will happen and they won't budge to keep new version it backwards-compatible for us , then it would be better to add a tigher upper bound.

@damccorm
Copy link
Contributor

they won't budge to keep new version it backwards-compatible for us

I don't think this is necessarily true, my takeaway from the above conversation is that it is fairly unlikely for them to break us with a minor version update.

in this scenario, will the pipeline fail at runtime or at job submission ?

At runtime. So I think the only scenario where we'd run into issues is if they have an extra package (or custom container) that requires >=2.15 since we couldn't satisfy that demand with the lower version, so pip would likely upgrade the package.

Again, I think all of this is pretty low likelihood of happening though.

@shunping
Copy link
Contributor

I have told the support team to keep me updated when there is any incoming change on the batch module.

@damccorm damccorm added the done & done Issue has been reviewed after it was closed for verification, followups, etc. label Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
done & done Issue has been reviewed after it was closed for verification, followups, etc. P3 python task
Projects
None yet
Development

No branches or pull requests

9 participants