Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[supply] parallel uploads for meta per language #21474

Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion fastlane_core/lib/fastlane_core/queue_worker.rb
Expand Up @@ -6,7 +6,7 @@ module FastlaneCore
# Use this when you have all the items that you'll process in advance.
# Simply enqueue them to this and call `QueueWorker#start`.
class QueueWorker
NUMBER_OF_THREADS = FastlaneCore::Helper.test? ? 1 : [(ENV["DELIVER_NUMBER_OF_THREADS"] || ENV.fetch("FL_NUMBER_OF_THREADS", 10)).to_i, 10].min
NUMBER_OF_THREADS = FastlaneCore::Helper.test? ? 1 : [(ENV["DELIVER_NUMBER_OF_THREADS"] || ENV.fetch("FL_NUMBER_OF_THREADS", 10)).to_i, 1].max
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this change 🤔

I think the intention of the initial code was to use FL_NUMBER_OF_THREADS if defined, but capping that value to 10 to not risk using more than 10 threads. While your change allows to use an unlimited number of threads. But having a maximum parallelism is recommended, because adding more threads is not always faster (given the OS and thread scheduler has to constantly switch between threads contexts so there's a sweet spot / critical number where adding more threads is actually detrimental rather than beneficial)

Also, given the behavior of ENV.fetch(key, default), the expression used as the first item of the array will be 10 if all tested ENV vars are nil. So that 1 value as second item of the array isn't really useful here, because it will never be the value picked.


I think the behavior we want is:

  • If either DELIVER_NUMBER_OF_THREADS or FL_NUMBER_OF_THREADS is defined, use that value, but up to a maximum value of 10 (i.e. if those env vars contain a number ≤10, use it, but if the value is >10, cap it to 10 instead)
  • If none of the env vars are defined, use a default value of [10.to_i, 10].min = 10

At least that's what the initial implementation was doing. While after your changes here this leads to:

  • If either DELIVER_NUMBER_OF_THREADS or FL_NUMBER_OF_THREADS is defined, use that value as long as it's greater than 1 ([value.to_i, 1].max
  • If none of the env vars are defined, use a default value of [10.to_i, 1].max = 10

The only case that might not be covered in the original implementation is if one of the env var is defined… but with a value that is 0 or that is not a number (e.g. aaa). In that case .to_i would return 0, so ['aaa'.to_i, 10].min would also return 0. This is indeed worth fixing (to use the default instead as if the env var was not defined). But we still need an upper bound of 10 to avoid thread-bombing.


Maybe this below — i.e. finding the first positive (>0) value after converting all candidates to to_i, then clamping between 1 and 10 — would make more sense to cover all cases and to better match the behavior we want ultimately? wdyt?

Suggested change
NUMBER_OF_THREADS = FastlaneCore::Helper.test? ? 1 : [(ENV["DELIVER_NUMBER_OF_THREADS"] || ENV.fetch("FL_NUMBER_OF_THREADS", 10)).to_i, 1].max
NUMBER_OF_THREADS = FastlaneCore::Helper.test? ? 1 : [ENV["DELIVER_NUMBER_OF_THREADS"], ENV["FL_NUMBER_OF_THREADS"], 10].map(&:to_i).find(&:positive?).clamp(1, 10)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused as well about what was the intention of this line and how we set the number of threads 😅 , so I initially thought it would be best to do a change that's close to the original behaviour.

Capping the value of the max number of threads spawned by fastlane sounds like a sensible choice and I agree we should do it (we don't know what users might do later 😄). But I would also need a way to "unlock" that maximum cap for the reasons I stated above (which mostly applies to users who understand the consequence), so how about adding another env var called FL_MAX_NUMBER_OF_THREADS with a default value of 10? Improving on your suggestion, it would look like so:

Suggested change
NUMBER_OF_THREADS = FastlaneCore::Helper.test? ? 1 : [(ENV["DELIVER_NUMBER_OF_THREADS"] || ENV.fetch("FL_NUMBER_OF_THREADS", 10)).to_i, 1].max
NUMBER_OF_THREADS = FastlaneCore::Helper.test? ? 1 : [ENV["DELIVER_NUMBER_OF_THREADS"], ENV["FL_NUMBER_OF_THREADS"], 10].map(&:to_i).find(&:positive?).clamp(1, ENV.fetch("FL_MAX_NUMBER_OF_THREADS", 10).to_i)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sounds like a great balance.


# @param concurrency (Numeric) - A number of threads to be created
# @param block (Proc) - A task you want to execute with enqueued items
Expand Down
35 changes: 24 additions & 11 deletions supply/lib/supply/uploader.rb
@@ -1,6 +1,10 @@
require 'fastlane_core'

module Supply
# rubocop:disable Metrics/ClassLength
class Uploader
UploadJob = Struct.new(:language, :version_code, :release_notes)

def perform_upload
FastlaneCore::PrintTable.print_values(config: Supply.config, hide_keys: [:issuer], mask_keys: [:json_key_data], title: "Summary for supply #{Fastlane::VERSION}")

Expand Down Expand Up @@ -81,6 +85,21 @@ def perform_upload_meta(version_codes, track_name)
version_codes.to_s == ""
end

upload_worker = FastlaneCore::QueueWorker.new do |job|
begin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think beginend is unnecessary when they are encapsulating the whole body of a block? i.e. I think you can remove begin and end while still keeping your rescue here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! will change

UI.message("Preparing uploads for language '#{job.language}'...")
start_time = Time.now
listing = client.listing_for_language(job.language)
upload_metadata(job.language, listing) unless Supply.config[:skip_upload_metadata]
upload_images(job.language) unless Supply.config[:skip_upload_images]
upload_screenshots(job.language) unless Supply.config[:skip_upload_screenshots]
job.release_notes << upload_changelog(job.language, job.version_code) unless Supply.config[:skip_upload_changelogs]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure if this really works well in a multi-threaded environment? Especially since Ruby mostly works with values, not references like for the concept of inout parameter in some other languages.

Basically you initialize local variable release_notes = [] on line 110, then pass the value of that var when building your UploadJob instance that you enqueue. Then for each job you append a new value to job.release_notes… but is that the same reference as the initial release_notes variable you created on line 110, as opposed to be a passed-by-value copy?? 🤔

(Also, even if it was passed by reference, what happens if multiple UploadJobs are finishing potentially out of order and try to append to that array concurrently? 🤔 💭 )


Maybe this works in practice though, because I think that in some way ruby Arrays "act as references"? So adding a value to an array you passed as parameter will add to the original (not to a copied of the passed-by-value array)… 🤔 But even if it does, I'm not sure it feels natural to rely on this.

Instead, wdyt about:

  • Removing :release_notes field from your UploadJob struct
  • Declaring release_notes = [] before you declare your upload_worker and its block (so that the variable is declared and can be captured by the block declaration)
  • Inside your block, use release_notes << upload_changelog(…) to directly append to the outside, captured variable

This approach would be very similar to the example that is given in the official doc of Thread::new 🙃

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you pointed that out, I think it sounds reasonable. Iirc, ruby passes everything by reference except immutable values (like integers), but just to be on the safe side we can move the declaration of release_notes above upload_worker and let the scope see it, as you suggested.


what happens if multiple UploadJobs are finishing potentially out of order and try to append to that array concurrently?

I don't think the order of the results matters that much since in upload_changelogs there's nothing that would depend on it 👀

UI.message("Uploaded all items for language '#{job.language}'... (#{Time.now - start_time} secs)")
rescue => error
UI.error("#{job.language} - #{error}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should exit (e.g. UI. abort_with_message!) in case of error on any language, instead of just printing a red message but still continuing with the rest.

Personally if something went wrong with one of the uploads (while some others passed) I'd prefer not to have partially uploaded data in the Play Store (e.g. only half of my screenshots for a single language because upload_screenshots raised mid-upload for some reason) and to instead have the whole lane stop to give me a change to fix and retry.

Otherwise the overall call to upload_to_play_store action will print red messages to the logs, but will still ultimately succeed (because UI.error doesn't raise, unlike UI. abort_with_message!) while having only partial data in PSC. This could be even more problematic if you run that lane on CI (like we do in our setup at my company), because that means your CI step will show up green making you think that everything went well, while ultimately your end users will have some screenshots or images or release notes missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my experience I would rather let supply upload all available assets and fix the missing ones later (w or w/o another fastlane call). The reason is because of the team I'm working with (Marketing) is prone to many human errors that invalidates the data we send to the Google Play Console (like title too long, invalid screenshot resolution, etc) and they prefer manual fixes in the UI rather in an automated system 🤷🏻‍♂️

Also if I'm not mistaken, I think this is the default behaviour of deliver as well (continue all uploads and stop with error at the end) 🤔


that means your CI step will show up green making you think that everything went well, while ultimately your end users will have some screenshots or images or release notes missing.

Tested locally and the exit code is 1 when there is an upload error, so we should be covered in CI when an error is thrown. Do you mean that when setting changes_not_sent_for_review to false and the track is production the process will continue to submit the app to the store even if some uploads failed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah TIL that deliver will fail later if an error was encountered in one of the cases. As long as the final exit code is 1 if any error occurred during the upload then I'm ok with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a couple more tests and realized that any network or API errors will actually be logged and ignored during the upload process. For instance, the following error was thrown but the process exited with code 0.

[21:24:37]: fr-FR - Google Api Error: Unauthorized - Request is missing required authentication credential. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project.

The errors I saw in the past tests which made me believe the process exited win non zero code were actually thrown later by client.commit_current_edit.

In this case, I propose to initialize a "has_errors" bool value before starting the uploads, set it to true when we catch an error, then check its value after the uploads complete and use a UI.abort_with_message to inform the user to check the (above) logs for any errors. How does that sound @AliSoftware?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, the following error was thrown

I know it's not the topic of your comment, but FYI the "Google Api Error" you are seeing in your tests are likely unrelated to the changes in your PR btw, but more likely related to a current issue apparently randomly happening for everyone with the Google Api and fastlane — I myself have been having this issue all day even with fastlane 2.214.0 while it worked a couple of days ago.

In this case, I propose to initialize a "has_errors" bool value before starting the uploads, set it to true when we catch an error, then check its value after the uploads complete and use a UI.abort_with_message to inform the user to check the (above) logs for any errors. How does that sound @AliSoftware?

Just to be clear, you'd UI.abord_with_message! only after having called client.commit_current_edit though, which means that the action would exit with non-zero code… but the changes would still have been submitted to Google Play, right?

I can see why you'd want that, but what I don't like with this approach is that it differs from how things were behaving before. In current version of fastlane if one upload fails, then the action stops without trying further to upload other images nor committing the edit and submitting that passed and skipping the uploads that failed. So your suggestion would change that behavior.


Given that, but also given that for the current unrelated issue we're having with Google API I was also considering implementing some auto-retry mechanism, I'd then suggest to:

  • Make this current PR keep the same behavior on failure as what we currently have without parallel uploads. That is, if one upload fails, stop everything without trying to upload the rest
  • Separately, make a PR to add ConfigItem.new(key: allow_submit_partial_changes) (happy if you have a better name idea for that one 😄 ) that would allow the behavior you're after (i.e. allowing the screenshots that passed upload to be submitted even if other screenshots failed). But that would be an opt-in option, so that everyone could choose the behavior they prefer in case of failures (you're probably turn it on while I'd personally prefer it off), and also so that we don't break backwards compatibility in the action's behavior.
  • Separately again, I was thinking making a PR to add some ConfigItem.new(key: :max_retries) to the upload_to_play_store action too, to allow retrying each API calls a max number of times if one fails. Which should already reduce the risk of a whole batch of screenshot uploads failing mid-process, and would also help mitigating Google Api Error: Unauthorized - Request is missing required authentication credential when trying to do a Google Play release #21507

Copy link
Contributor Author

@ccaruceru ccaruceru Sep 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads-up about the google api error, I thought it was just me 😄

Just to be clear, you'd UI.abord_with_message! only after having called client.commit_current_edit though, which means that the action would exit with non-zero code… but the changes would still have been submitted to Google Play, right?

Actually not, what I was thinking is adding an extra line immediately after starting the worker processes (line 115) to check for the error (something that looks like this):

upload_worker.start
UI.abort_with_message!("Some of the upload processes ended with failure! Check the logs above for more infromation.") if has_errors

update: No, you were right. The error should be actually thrown after the Edit is committed, otherwise we uploaded the data for nothing.


But the last couple of points you made make sense to me now, so in this PR I'll focus to keep the current behaviour then do a follow up in another PR with the "allow submit partial changes" functionality.

end
end

version_codes.each do |version_code|
UI.user_error!("Could not find folder #{metadata_path}") unless File.directory?(metadata_path)

Expand All @@ -89,17 +108,11 @@ def perform_upload_meta(version_codes, track_name)
UI.user_error!("Could not find release for version code '#{version_code}' to update changelog") unless release

release_notes = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One difference in behavior before/after this change is that before, release_notes was reset to [] on each iteration of the version_codes loop, while now it's not, and thus is common to all version_codes.

That might break the logic (and potentially mix up the release notes published when there are different version codes with different changelogs/*.txt files.

You might thus want to keep this reset of release_notes = [] inside the loop to cover for this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely observed 😄 I haven't been testing the multi version codes scenario. Will fix.

all_languages.each do |language|
next if language.start_with?('.') # e.g. . or .. or hidden folders
UI.message("Preparing to upload for language '#{language}'...")

listing = client.listing_for_language(language)

upload_metadata(language, listing) unless Supply.config[:skip_upload_metadata]
upload_images(language) unless Supply.config[:skip_upload_images]
upload_screenshots(language) unless Supply.config[:skip_upload_screenshots]
release_notes << upload_changelog(language, version_code) unless Supply.config[:skip_upload_changelogs]
end
upload_worker.batch_enqueue(
# skip . or .. or hidden folders
all_languages.reject { |lang| lang.start_with?('.') }.map { |lang| UploadJob.new(lang, version_code, release_notes) }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.reject behaves the same as the previously used next if

)
upload_worker.start

upload_changelogs(release_notes, release, track, track_name) unless release_notes.empty?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reviewing this I initially wondered if upload_worder.start was guaranteed to block, i.e. to wait for all jobs in the queue to finish before continuing to the next line. From what I see in the implementation of FastlaneCore::QueueWorker#start, there's a threads.each(:join) at the end of that method so we should be good after all (while if it wasn't the case, upload_changelogs(release_notes, …) would be run too soon…)

I'm telling you this because I feel like this PR should definitively benefit some additional unit tests (aka specs) to help us guarantee the behavior. Especially given how tricky multi-threaded code can sometimes be, with hidden edge cases and all, better get some additional assurance 🙂

Of course, the fact that NUMBER_OF_THREADS = FastlaneCore::Helper.test? ? 1 : … make the code basically work in serial (mono-thread) and not in parallel in the context of tests, will not really allow us to write tests that ensure that parallel execution won't have any major side effects that we wouldn't have thought about in case some actions are done out of order. But it could still help catch up some other side effects (like the release_notes case) so this would still be valuable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing tests is always a good idea. I'll have a look at your suggestions above and take some time this week make that happen 🧪

end
Expand Down