-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sources(curl): download multiple URLs with the same curl command #1573
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a brief review and this looks awesome, thanks! Less forking and doing stuff in parallel seems awesome. :)
Can you please improve the commit message for "curl: add _split_chksum_desc_tuples"? It's basically the most important commit in the series, but the commit message doesn't reflect that. :)
Even modest improvement is improvement. Can you rebase this and fix |
2e1ec3d
to
249252f
Compare
When moving to parallel downloads in curl we will need more comprehensive tests. This commits starts with building some infrastructure for this.
When moving to parallel downloads in curl we will need more comprehensive tests. This commits starts with building some infrastructure for this.
When moving to parallel downloads in curl we will need more comprehensive tests. This commits starts with building some infrastructure for this.
This PR is stale because it has been open 30 days with no activity. Remove "Stale" label or comment or this will be closed in 7 days. |
Remove the stale label for now, I still think this is a good idea. When there is time :) |
Ran the timings for
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like; while it is useful for its performance improvements over the current curl stage, it seems to make the curl stage a lot more complex with all the setup, configuration files, and all these things as compared to an approach that doesn't go through a curl binary.
Would it perhaps be a good idea to try using libcurl bindings through pycurl? This is likely shipped in RHEL and gets rid of the subprocesses, configuration setup, and json parsing.
Thank you for the measurements above, this is great! Interesting that curl is still lacking behind, I wonder if: index 8a45e3b6..3d8e207e 100755
--- a/sources/org.osbuild.curl
+++ b/sources/org.osbuild.curl
@@ -185,6 +185,8 @@ def fetch_many_new_curl(tmpdir, targetdir, dl_pairs):
# this adds a bunch of noise but might be nice for debug?
# "--show-error",
"--parallel",
+ "--parallel-max", "100",
+ "--parallel-immediate",
# this will write out a json record for each finished download
"--write-out", "%{json}\n",
] makes any difference? As for libcurl/httpx - maybe I misunderstood, I was assuming one of the goal is to stay compatible with old manifests. They will only pull in curl (the binary) in the buildroot(AIUI?) so this is targeting this. If this is not a design constraint I'm happy to look into libcurl and see if I can make it less complex. I looked at this code here again but I don't see real wins here anymore (but happy to be proven wrong!), I think when using the curl binaries this is roughly the complexity we will get (less once we can drop the old code, there is a bit of duplication that I really don't like but that is hard to get rid of but maybe anyother look tomorrow brings me ideas). |
Sources run outside the sandbox if I recall correctly. You are right that this would apply cleanly to old manifests as it improves the existing curl source while only new manifests could use differently named sources. |
Interesting, old curl7.72user 3.37system 1:26.58elapsed 12%CPU (0avgtext+0avgdata 30384maxresident)k new curl1.68user 1.38system 0:44.46elapsed 6%CPU (0avgtext+0avgdata 30780maxresident)k new curl with tweaks (parallel-immediate)1.97user 1.51system 0:47.14elapsed 7%CPU (0avgtext+0avgdata 38396maxresident)k httpx8.82user 1.93system 0:51.83elapsed 20%CPU (0avgtext+0avgdata 82976maxresident)k |
Instead of passing the url and options on the commandline this commit moves it into a config file. This is not useful just yet but it will be once we download multiple urls per curl instance.
Modern curl (7.68+) has a --parallel option that will download multiple sources in parallel. This commit adds detection for this feature as it is only available after RHEL 8. In addition we need some more feature to properly support --parallel, i.e. `--write-out` with json and exitcode options. This bumps the requirements to 7.75+ which is still fine, centos9/RHEL9 have 7.76.
When using a modern curl we can download download multiple urls in parallel which avoids connection setup overhead and is generally more efficient. Use when it's detected. TODO: ensure both old and new curl are tested automatically via the testsuite.
To ensure there are no regressions with the old curl make sure to run all tests that fetch_all() with both old and new curl.
Is this on the same manifest? I'd assume the percentage overhead of what needs to be done for curl to be larger the faster the pipe becomes. And yes, this PR is still a nice speed up. |
We cannot use `curl --write-out %{json}` because older curl (7.76 from RHEL9/Centos9) will write `{"http_connect":000}` which python cannot parse.
I'm not sure which manifest you used. I used the fedora-boot one. Not sure if that matters a lot, it's good to have multiple data points I think. :) |
fedora-boot :) |
Sources run on the host before osbuild sets up a buildroot or bubblewrap containers. |
What if we add the ability for some pipelines to do networking so we can have a conceptual /me runs away. |
This PR is based on the excellent work of @supakeen in #1549 and the findings in #1023
Simon found that curl has a lot of fork/exec overhead in the way we use it. The other issue with our current use of curl is that we do not reuse connections and the overhead of the connection setup (especially for https) is there.
Newer versions of curl (7.68+) have a nice new feature called
--parallel
that will reuse connections and download 50 streams in parallel by default (this can be controlled via the --parallel-max option). Since we want some more features (like --write-out %{json}) we actually need curl 7.75+ to use --parallel in a nice way. But that is okay because RHEL9/Centos9 have 7.76 as their default (but of course I can bring it back if people feel strongly about this).Note that this PR originally also parallelized the old forking() curl but I removed this because it seems like a lot of extra work just for RHEL8 which is (AIUI) in maintenance mode.
The numbers I see are relatively modest for me though, for me it's mostly I/O bound as my connections is not that fast maybe someone with a fast connection can also test this? I ran: