Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sources(curl): download multiple URLs with the same curl command #1573

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

mvo5
Copy link
Contributor

@mvo5 mvo5 commented Feb 6, 2024

This PR is based on the excellent work of @supakeen in #1549 and the findings in #1023

Simon found that curl has a lot of fork/exec overhead in the way we use it. The other issue with our current use of curl is that we do not reuse connections and the overhead of the connection setup (especially for https) is there.

Newer versions of curl (7.68+) have a nice new feature called --parallel that will reuse connections and download 50 streams in parallel by default (this can be controlled via the --parallel-max option). Since we want some more features (like --write-out %{json}) we actually need curl 7.75+ to use --parallel in a nice way. But that is okay because RHEL9/Centos9 have 7.76 as their default (but of course I can bring it back if people feel strongly about this).

Note that this PR originally also parallelized the old forking() curl but I removed this because it seems like a lot of extra work just for RHEL8 which is (AIUI) in maintenance mode.

The numbers I see are relatively modest for me though, for me it's mostly I/O bound as my connections is not that fast maybe someone with a fast connection can also test this? I ran:

$ rm -rf .osbuild/ ; time python3 -m osbuild  --libdir . ./test/data/manifests/fedora-container.json 
...
real	0m28.433s
user	0m6.179s
sys	0m4.821s


[upstream/main at v117-13-gaf0e8490]
$ rm -rf .osbuild/ ; time python3 -m osbuild  --libdir . ./test/data/manifests/fedora-container.json 
...
real	0m32.849s
user	0m17.607s
sys	0m6.236s
`` 
and see broadly the same total time except that the "user" time is significantly lower.

[Build on top of https://github.com/osbuild/osbuild/pull/1767 - once that is merged this one will be a lot shorter]

Copy link
Member

@ondrejbudai ondrejbudai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a brief review and this looks awesome, thanks! Less forking and doing stuff in parallel seems awesome. :)

Can you please improve the commit message for "curl: add _split_chksum_desc_tuples"? It's basically the most important commit in the series, but the commit message doesn't reflect that. :)

sources/org.osbuild.curl Outdated Show resolved Hide resolved
@supakeen
Copy link
Member

Even modest improvement is improvement. Can you rebase this and fix tox.ini so I can approve? :)

@mvo5 mvo5 marked this pull request as ready for review February 27, 2024 17:27
@mvo5 mvo5 force-pushed the curl-next3 branch 4 times, most recently from 2e1ec3d to 249252f Compare February 28, 2024 17:08
sources/org.osbuild.curl Outdated Show resolved Hide resolved
sources/org.osbuild.curl Outdated Show resolved Hide resolved
sources/org.osbuild.curl Outdated Show resolved Hide resolved
mvo5 added a commit to mvo5/osbuild that referenced this pull request Apr 3, 2024
When moving to parallel downloads in curl we will need more
comprehensive tests. This commits starts with building some
infrastructure for this.
mvo5 added a commit to mvo5/osbuild that referenced this pull request Apr 5, 2024
When moving to parallel downloads in curl we will need more
comprehensive tests. This commits starts with building some
infrastructure for this.
mvo5 added a commit that referenced this pull request Apr 5, 2024
When moving to parallel downloads in curl we will need more
comprehensive tests. This commits starts with building some
infrastructure for this.
Copy link

This PR is stale because it has been open 30 days with no activity. Remove "Stale" label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale label Apr 18, 2024
@bcl bcl removed the Stale label Apr 23, 2024
@bcl
Copy link
Contributor

bcl commented Apr 23, 2024

Remove the stale label for now, I still think this is a good idea. When there is time :)

sources/org.osbuild.curl Outdated Show resolved Hide resolved
@supakeen
Copy link
Member

supakeen commented Apr 30, 2024

Ran the timings for org.osbuild.curl from this PR and an (adjusted for the new interface) org.osbuild.httpx from #1023:

root@muja:/home/user# rm -rf .osbuild && time osbuild fedora-boot-curl-parallel.json > /dev/null

real    0m21.680s
user    0m4.264s
sys 0m2.330s
root@muja:/home/user# rm -rf .osbuild && time osbuild fedora-boot-httpx-aiofiles.json > /dev/null

real    0m14.057s
user    0m10.557s
sys 0m2.376s

Copy link
Member

@supakeen supakeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like; while it is useful for its performance improvements over the current curl stage, it seems to make the curl stage a lot more complex with all the setup, configuration files, and all these things as compared to an approach that doesn't go through a curl binary.

Would it perhaps be a good idea to try using libcurl bindings through pycurl? This is likely shipped in RHEL and gets rid of the subprocesses, configuration setup, and json parsing.

@mvo5
Copy link
Contributor Author

mvo5 commented Apr 30, 2024

I feel like; while it is useful for its performance improvements over the current curl stage, it seems to make the curl stage a lot more complex with all the setup, configuration files, and all these things as compared to an approach that doesn't go through a curl binary.

Would it perhaps be a good idea to try using libcurl bindings through pycurl? This is likely shipped in RHEL and gets rid of the subprocesses, configuration setup, and json parsing.

Thank you for the measurements above, this is great! Interesting that curl is still lacking behind, I wonder if:

index 8a45e3b6..3d8e207e 100755
--- a/sources/org.osbuild.curl
+++ b/sources/org.osbuild.curl
@@ -185,6 +185,8 @@ def fetch_many_new_curl(tmpdir, targetdir, dl_pairs):
         # this adds a bunch of noise but might be nice for debug?
         # "--show-error",
         "--parallel",
+        "--parallel-max", "100",
+        "--parallel-immediate",
         # this will write out a json record for each finished download
         "--write-out", "%{json}\n",
     ]

makes any difference?

As for libcurl/httpx - maybe I misunderstood, I was assuming one of the goal is to stay compatible with old manifests. They will only pull in curl (the binary) in the buildroot(AIUI?) so this is targeting this. If this is not a design constraint I'm happy to look into libcurl and see if I can make it less complex. I looked at this code here again but I don't see real wins here anymore (but happy to be proven wrong!), I think when using the curl binaries this is roughly the complexity we will get (less once we can drop the old code, there is a bit of duplication that I really don't like but that is hard to get rid of but maybe anyother look tomorrow brings me ideas).

@supakeen
Copy link
Member

supakeen commented Apr 30, 2024

Sources run outside the sandbox if I recall correctly. You are right that this would apply cleanly to old manifests as it improves the existing curl source while only new manifests could use differently named sources.

@ondrejbudai
Copy link
Member

Interesting, httpx is comparable to curl on my setup. 🤔 I think I'm still in favor of this PR, it will provide a very interesting self-contained speed improvement. Alternatively, we can focus on finishing #1304 which will bring support for handling mirror fallbacks properly.

old curl

7.72user 3.37system 1:26.58elapsed 12%CPU (0avgtext+0avgdata 30384maxresident)k
344inputs+0outputs (31major+542192minor)pagefaults 0swaps

new curl

1.68user 1.38system 0:44.46elapsed 6%CPU (0avgtext+0avgdata 30780maxresident)k
0inputs+0outputs (0major+12908minor)pagefaults 0swaps

new curl with tweaks (parallel-immediate)

1.97user 1.51system 0:47.14elapsed 7%CPU (0avgtext+0avgdata 38396maxresident)k
0inputs+0outputs (0major+17992minor)pagefaults 0swaps

httpx

8.82user 1.93system 0:51.83elapsed 20%CPU (0avgtext+0avgdata 82976maxresident)k
32inputs+1264outputs (0major+24259minor)pagefaults 0swaps

mvo5 added 4 commits May 2, 2024 09:08
Instead of passing the url and options on the commandline this
commit moves it into a config file. This is not useful just yet
but it will be once we download multiple urls per curl instance.
Modern curl (7.68+) has a --parallel option that will download
multiple sources in parallel. This commit adds detection for this
feature as it is only available after RHEL 8.

In addition we need some more feature to properly support --parallel,
i.e. `--write-out` with json and exitcode options. This bumps the
requirements to 7.75+ which is still fine, centos9/RHEL9 have
7.76.
When using a modern curl we can download download multiple urls
in parallel which avoids connection setup overhead and is generally
more efficient. Use when it's detected.

TODO: ensure both old and new curl are tested automatically via
the testsuite.
To ensure there are no regressions with the old curl make
sure to run all tests that fetch_all() with both old and
new curl.
@supakeen
Copy link
Member

supakeen commented May 2, 2024

Interesting, httpx is comparable to curl on my setup. 🤔 I think I'm still in favor of this PR, it will provide a very interesting self-contained speed improvement. Alternatively, we can focus on finishing #1304 which will bring support for handling mirror fallbacks properly.

old curl

7.72user 3.37system 1:26.58elapsed 12%CPU (0avgtext+0avgdata 30384maxresident)k 344inputs+0outputs (31major+542192minor)pagefaults 0swaps

new curl

1.68user 1.38system 0:44.46elapsed 6%CPU (0avgtext+0avgdata 30780maxresident)k 0inputs+0outputs (0major+12908minor)pagefaults 0swaps

new curl with tweaks (parallel-immediate)

1.97user 1.51system 0:47.14elapsed 7%CPU (0avgtext+0avgdata 38396maxresident)k 0inputs+0outputs (0major+17992minor)pagefaults 0swaps

httpx

8.82user 1.93system 0:51.83elapsed 20%CPU (0avgtext+0avgdata 82976maxresident)k 32inputs+1264outputs (0major+24259minor)pagefaults 0swaps

Is this on the same manifest? I'd assume the percentage overhead of what needs to be done for curl to be larger the faster the pipe becomes.

And yes, this PR is still a nice speed up.

We cannot use `curl --write-out %{json}` because older curl
(7.76 from RHEL9/Centos9) will write `{"http_connect":000}`
which python cannot parse.
@ondrejbudai
Copy link
Member

Is this on the same manifest? I'd assume the percentage overhead of what needs to be done for curl to be larger the faster the pipe becomes.

I'm not sure which manifest you used. I used the fedora-boot one. Not sure if that matters a lot, it's good to have multiple data points I think. :)

@supakeen
Copy link
Member

supakeen commented May 3, 2024

Is this on the same manifest? I'd assume the percentage overhead of what needs to be done for curl to be larger the faster the pipe becomes.

I'm not sure which manifest you used. I used the fedora-boot one. Not sure if that matters a lot, it's good to have multiple data points I think. :)

fedora-boot :)

@achilleas-k
Copy link
Member

They will only pull in curl (the binary) in the buildroot(AIUI?)

Sources run on the host before osbuild sets up a buildroot or bubblewrap containers.
Anything used by a source becomes a dependency of osbuild.

@supakeen
Copy link
Member

supakeen commented May 6, 2024

What if we add the ability for some pipelines to do networking so we can have a conceptual netroot as well ;)

/me runs away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants