sources(curl): download multiple URLs with the same curl command #1573

mvo5 · 2024-02-06T17:29:14Z

This PR is based on the excellent work of @supakeen in #1549 and the findings in #1023

Simon found that curl has a lot of fork/exec overhead in the way we use it. The other issue with our current use of curl is that we do not reuse connections and the overhead of the connection setup (especially for https) is there.

Newer versions of curl (7.68+) have a nice new feature called --parallel that will reuse connections and download 50 streams in parallel by default (this can be controlled via the --parallel-max option). Since we want some more features (like --write-out %{json}) we actually need curl 7.75+ to use --parallel in a nice way. But that is okay because RHEL9/Centos9 have 7.76 as their default (but of course I can bring it back if people feel strongly about this).

Note that this PR originally also parallelized the old forking() curl but I removed this because it seems like a lot of extra work just for RHEL8 which is (AIUI) in maintenance mode.

The numbers I see are relatively modest for me though, for me it's mostly I/O bound as my connections is not that fast maybe someone with a fast connection can also test this? I ran:

$ rm -rf .osbuild/ ; time python3 -m osbuild  --libdir . ./test/data/manifests/fedora-container.json 
...
real	0m28.433s
user	0m6.179s
sys	0m4.821s


[upstream/main at v117-13-gaf0e8490]
$ rm -rf .osbuild/ ; time python3 -m osbuild  --libdir . ./test/data/manifests/fedora-container.json 
...
real	0m32.849s
user	0m17.607s
sys	0m6.236s
`` 
and see broadly the same total time except that the "user" time is significantly lower.

[Build on top of https://github.com/osbuild/osbuild/pull/1767 - once that is merged this one will be a lot shorter]

ondrejbudai

I did a brief review and this looks awesome, thanks! Less forking and doing stuff in parallel seems awesome. :)

Can you please improve the commit message for "curl: add _split_chksum_desc_tuples"? It's basically the most important commit in the series, but the commit message doesn't reflect that. :)

sources/org.osbuild.curl

supakeen · 2024-02-25T08:30:51Z

Even modest improvement is improvement. Can you rebase this and fix tox.ini so I can approve? :)

sources/org.osbuild.curl

When moving to parallel downloads in curl we will need more comprehensive tests. This commits starts with building some infrastructure for this.

github-actions · 2024-04-18T04:10:28Z

This PR is stale because it has been open 30 days with no activity. Remove "Stale" label or comment or this will be closed in 7 days.

bcl · 2024-04-23T23:53:39Z

Remove the stale label for now, I still think this is a good idea. When there is time :)

sources/org.osbuild.curl

supakeen · 2024-04-30T12:08:50Z

Ran the timings for org.osbuild.curl from this PR and an (adjusted for the new interface) org.osbuild.httpx from #1023:

root@muja:/home/user# rm -rf .osbuild && time osbuild fedora-boot-curl-parallel.json > /dev/null

real    0m21.680s
user    0m4.264s
sys 0m2.330s

root@muja:/home/user# rm -rf .osbuild && time osbuild fedora-boot-httpx-aiofiles.json > /dev/null

real    0m14.057s
user    0m10.557s
sys 0m2.376s

supakeen

I feel like; while it is useful for its performance improvements over the current curl stage, it seems to make the curl stage a lot more complex with all the setup, configuration files, and all these things as compared to an approach that doesn't go through a curl binary.

Would it perhaps be a good idea to try using libcurl bindings through pycurl? This is likely shipped in RHEL and gets rid of the subprocesses, configuration setup, and json parsing.

mvo5 · 2024-04-30T13:52:25Z

I feel like; while it is useful for its performance improvements over the current curl stage, it seems to make the curl stage a lot more complex with all the setup, configuration files, and all these things as compared to an approach that doesn't go through a curl binary.

Would it perhaps be a good idea to try using libcurl bindings through pycurl? This is likely shipped in RHEL and gets rid of the subprocesses, configuration setup, and json parsing.

Thank you for the measurements above, this is great! Interesting that curl is still lacking behind, I wonder if:

index 8a45e3b6..3d8e207e 100755
--- a/sources/org.osbuild.curl
+++ b/sources/org.osbuild.curl
@@ -185,6 +185,8 @@ def fetch_many_new_curl(tmpdir, targetdir, dl_pairs):
         # this adds a bunch of noise but might be nice for debug?
         # "--show-error",
         "--parallel",
+        "--parallel-max", "100",
+        "--parallel-immediate",
         # this will write out a json record for each finished download
         "--write-out", "%{json}\n",
     ]

makes any difference?

As for libcurl/httpx - maybe I misunderstood, I was assuming one of the goal is to stay compatible with old manifests. They will only pull in curl (the binary) in the buildroot(AIUI?) so this is targeting this. If this is not a design constraint I'm happy to look into libcurl and see if I can make it less complex. I looked at this code here again but I don't see real wins here anymore (but happy to be proven wrong!), I think when using the curl binaries this is roughly the complexity we will get (less once we can drop the old code, there is a bit of duplication that I really don't like but that is hard to get rid of but maybe anyother look tomorrow brings me ideas).

supakeen · 2024-04-30T14:26:19Z

Sources run outside the sandbox if I recall correctly. You are right that this would apply cleanly to old manifests as it improves the existing curl source while only new manifests could use differently named sources.

ondrejbudai · 2024-04-30T15:50:07Z

Interesting, httpx is comparable to curl on my setup. 🤔 I think I'm still in favor of this PR, it will provide a very interesting self-contained speed improvement. Alternatively, we can focus on finishing #1304 which will bring support for handling mirror fallbacks properly.

old curl

7.72user 3.37system 1:26.58elapsed 12%CPU (0avgtext+0avgdata 30384maxresident)k
344inputs+0outputs (31major+542192minor)pagefaults 0swaps

new curl

1.68user 1.38system 0:44.46elapsed 6%CPU (0avgtext+0avgdata 30780maxresident)k
0inputs+0outputs (0major+12908minor)pagefaults 0swaps

new curl with tweaks (parallel-immediate)

1.97user 1.51system 0:47.14elapsed 7%CPU (0avgtext+0avgdata 38396maxresident)k
0inputs+0outputs (0major+17992minor)pagefaults 0swaps

httpx

8.82user 1.93system 0:51.83elapsed 20%CPU (0avgtext+0avgdata 82976maxresident)k
32inputs+1264outputs (0major+24259minor)pagefaults 0swaps

Instead of passing the url and options on the commandline this commit moves it into a config file. This is not useful just yet but it will be once we download multiple urls per curl instance.

Modern curl (7.68+) has a --parallel option that will download multiple sources in parallel. This commit adds detection for this feature as it is only available after RHEL 8. In addition we need some more feature to properly support --parallel, i.e. `--write-out` with json and exitcode options. This bumps the requirements to 7.75+ which is still fine, centos9/RHEL9 have 7.76.

When using a modern curl we can download download multiple urls in parallel which avoids connection setup overhead and is generally more efficient. Use when it's detected. TODO: ensure both old and new curl are tested automatically via the testsuite.

To ensure there are no regressions with the old curl make sure to run all tests that fetch_all() with both old and new curl.

supakeen · 2024-05-02T09:38:40Z

Interesting, httpx is comparable to curl on my setup. 🤔 I think I'm still in favor of this PR, it will provide a very interesting self-contained speed improvement. Alternatively, we can focus on finishing #1304 which will bring support for handling mirror fallbacks properly.

old curl

7.72user 3.37system 1:26.58elapsed 12%CPU (0avgtext+0avgdata 30384maxresident)k 344inputs+0outputs (31major+542192minor)pagefaults 0swaps

new curl

1.68user 1.38system 0:44.46elapsed 6%CPU (0avgtext+0avgdata 30780maxresident)k 0inputs+0outputs (0major+12908minor)pagefaults 0swaps

new curl with tweaks (parallel-immediate)

1.97user 1.51system 0:47.14elapsed 7%CPU (0avgtext+0avgdata 38396maxresident)k 0inputs+0outputs (0major+17992minor)pagefaults 0swaps

httpx

8.82user 1.93system 0:51.83elapsed 20%CPU (0avgtext+0avgdata 82976maxresident)k 32inputs+1264outputs (0major+24259minor)pagefaults 0swaps

Is this on the same manifest? I'd assume the percentage overhead of what needs to be done for curl to be larger the faster the pipe becomes.

And yes, this PR is still a nice speed up.

We cannot use `curl --write-out %{json}` because older curl (7.76 from RHEL9/Centos9) will write `{"http_connect":000}` which python cannot parse.

ondrejbudai · 2024-05-02T11:04:44Z

Is this on the same manifest? I'd assume the percentage overhead of what needs to be done for curl to be larger the faster the pipe becomes.

I'm not sure which manifest you used. I used the fedora-boot one. Not sure if that matters a lot, it's good to have multiple data points I think. :)

supakeen · 2024-05-03T07:36:53Z

Is this on the same manifest? I'd assume the percentage overhead of what needs to be done for curl to be larger the faster the pipe becomes.

I'm not sure which manifest you used. I used the fedora-boot one. Not sure if that matters a lot, it's good to have multiple data points I think. :)

fedora-boot :)

achilleas-k · 2024-05-06T12:23:38Z

They will only pull in curl (the binary) in the buildroot(AIUI?)

Sources run on the host before osbuild sets up a buildroot or bubblewrap containers.
Anything used by a source becomes a dependency of osbuild.

supakeen · 2024-05-06T13:14:47Z

What if we add the ability for some pipelines to do networking so we can have a conceptual netroot as well ;)

/me runs away.

supakeen self-requested a review February 7, 2024 13:50

mvo5 force-pushed the curl-next3 branch from 48f7470 to 3db2d1a Compare February 7, 2024 14:10

ondrejbudai reviewed Feb 8, 2024

View reviewed changes

sources/org.osbuild.curl Outdated Show resolved Hide resolved

mvo5 force-pushed the curl-next3 branch from 3db2d1a to 57a4503 Compare February 8, 2024 15:50

mvo5 mentioned this pull request Feb 9, 2024

sources,util: add containers-storage source #1550

Merged

mvo5 marked this pull request as ready for review February 27, 2024 17:27

mvo5 force-pushed the curl-next3 branch 4 times, most recently from 2e1ec3d to 249252f Compare February 28, 2024 17:08

bcl reviewed Feb 28, 2024

View reviewed changes

sources/org.osbuild.curl Outdated Show resolved Hide resolved

bcl reviewed Feb 28, 2024

View reviewed changes

sources/org.osbuild.curl Outdated Show resolved Hide resolved

bcl reviewed Feb 28, 2024

View reviewed changes

sources/org.osbuild.curl Outdated Show resolved Hide resolved

mvo5 force-pushed the curl-next3 branch from e385c90 to 70b05ff Compare March 18, 2024 12:39

mvo5 mentioned this pull request Mar 18, 2024

sources: transform() is only used in the curl sources, remove from ABC and rename #1667

Merged

mvo5 added a commit to mvo5/osbuild that referenced this pull request Apr 3, 2024

source: add curl test in preparation for osbuild#1573

282c956

When moving to parallel downloads in curl we will need more comprehensive tests. This commits starts with building some infrastructure for this.

mvo5 added a commit to mvo5/osbuild that referenced this pull request Apr 5, 2024

source: add curl test in preparation for osbuild#1573

1843358

When moving to parallel downloads in curl we will need more comprehensive tests. This commits starts with building some infrastructure for this.

mvo5 added a commit that referenced this pull request Apr 5, 2024

source: add curl test in preparation for #1573

98f5904

When moving to parallel downloads in curl we will need more comprehensive tests. This commits starts with building some infrastructure for this.

github-actions bot added the Stale label Apr 18, 2024

bcl removed the Stale label Apr 23, 2024

mvo5 mentioned this pull request Apr 30, 2024

curl: add gen_curl_download_config() and use in download #1767

Closed

mvo5 force-pushed the curl-next3 branch from 70b05ff to d01debe Compare April 30, 2024 11:27

mvo5 commented Apr 30, 2024

View reviewed changes

sources/org.osbuild.curl Outdated Show resolved Hide resolved

mvo5 force-pushed the curl-next3 branch from d01debe to dd2f227 Compare April 30, 2024 11:56

supakeen reviewed Apr 30, 2024

View reviewed changes

mvo5 added 4 commits May 2, 2024 09:08

curl: add gen_curl_download_config() and use in download

348c545

Instead of passing the url and options on the commandline this commit moves it into a config file. This is not useful just yet but it will be once we download multiple urls per curl instance.

sources: run all tests for curl with both old and new curl

72a83bf

To ensure there are no regressions with the old curl make sure to run all tests that fetch_all() with both old and new curl.

mvo5 force-pushed the curl-next3 branch from 866600f to 72a83bf Compare May 2, 2024 07:48

fixup! include patch to manifest-db to get visiblity on gitlab failures

06f409f

mvo5 force-pushed the curl-next3 branch from ea1e776 to 06f409f Compare May 2, 2024 08:59

mvo5 mentioned this pull request May 2, 2024

tools: add logmonitor on stderr by default osbuild/manifest-db#123

Merged

sources: do not use %{json} when generating curl output

9b57e12

We cannot use `curl --write-out %{json}` because older curl (7.76 from RHEL9/Centos9) will write `{"http_connect":000}` which python cannot parse.

mvo5 force-pushed the curl-next3 branch from 96b2899 to 9b57e12 Compare May 2, 2024 10:26

mvo5 requested review from ondrejbudai and achilleas-k May 17, 2024 07:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sources(curl): download multiple URLs with the same curl command #1573

sources(curl): download multiple URLs with the same curl command #1573

mvo5 commented Feb 6, 2024 •

edited

ondrejbudai left a comment

supakeen commented Feb 25, 2024

github-actions bot commented Apr 18, 2024

bcl commented Apr 23, 2024

supakeen commented Apr 30, 2024 •

edited

supakeen left a comment •

edited

mvo5 commented Apr 30, 2024

supakeen commented Apr 30, 2024 •

edited

ondrejbudai commented Apr 30, 2024

supakeen commented May 2, 2024

old curl

new curl

new curl with tweaks (parallel-immediate)

httpx

ondrejbudai commented May 2, 2024

supakeen commented May 3, 2024

achilleas-k commented May 6, 2024

supakeen commented May 6, 2024

sources(curl): download multiple URLs with the same curl command #1573

Are you sure you want to change the base?

sources(curl): download multiple URLs with the same curl command #1573

Conversation

mvo5 commented Feb 6, 2024 • edited

ondrejbudai left a comment

Choose a reason for hiding this comment

supakeen commented Feb 25, 2024

github-actions bot commented Apr 18, 2024

bcl commented Apr 23, 2024

supakeen commented Apr 30, 2024 • edited

supakeen left a comment • edited

Choose a reason for hiding this comment

mvo5 commented Apr 30, 2024

supakeen commented Apr 30, 2024 • edited

ondrejbudai commented Apr 30, 2024

old curl

new curl

new curl with tweaks (parallel-immediate)

httpx

supakeen commented May 2, 2024

old curl

new curl

new curl with tweaks (parallel-immediate)

httpx

ondrejbudai commented May 2, 2024

supakeen commented May 3, 2024

achilleas-k commented May 6, 2024

supakeen commented May 6, 2024

mvo5 commented Feb 6, 2024 •

edited

supakeen commented Apr 30, 2024 •

edited

supakeen left a comment •

edited

supakeen commented Apr 30, 2024 •

edited