Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Earthly has overhead in simple cases #2049

Closed
vladaionescu opened this issue Jul 22, 2022 · 9 comments
Closed

Earthly has overhead in simple cases #2049

vladaionescu opened this issue Jul 22, 2022 · 9 comments
Assignees
Labels
type:performance Performance improvement request or issue

Comments

@vladaionescu
Copy link
Member

Example: https://gist.github.com/lynaghk/0f9b66cf889398a7c73b01f39beaffee

This takes 7 seconds when it's fully cached. (Toast, for example, takes 800ms in the same situation)

@vladaionescu vladaionescu added the type:performance Performance improvement request or issue label Jul 22, 2022
@alexcb
Copy link
Collaborator

alexcb commented Jul 22, 2022

VERSION 0.6

hello:
  FROM alpine:3.15
  COPY data .
  RUN cat data | md5sum > my_output
  SAVE ARTIFACT my_output AS LOCAL my_output

takes approx 7seconds on my ubuntu machine.

However when I run it under earthly-v0.6.6, it takes approx 1.3 seconds.

between v0.6.6 and v0.6.7, there was an update to buildkit: 8d4eb6b24e7253afd621f91f2589221bfd1dabdf..60bb7fe66d316609854ee3c89466df464044ef8f

and an update to fsutil:
9f87c4e70cf0..42fa88fbe869

@alexcb
Copy link
Collaborator

alexcb commented Jul 22, 2022

--use-copy-include-patterns was enabled by default in v0.6.0; which is being used in both v0.6.6 and v0.6.7.

@alexcb
Copy link
Collaborator

alexcb commented Jul 22, 2022

here's buildkit logs from v0.6.7

time="2022-07-22T19:57:57Z" level=debug msg="received shell connection" app=shellrepeater remote.addr="172.17.0.1:37026"
time="2022-07-22T19:57:57Z" level=debug msg="closing existing shell connection" app=shellrepeater remote.addr="172.17.0.1:37026"
time="2022-07-22T19:57:57Z" level=debug msg="session started"
time="2022-07-22T19:57:57Z" level=debug msg=resolving
time="2022-07-22T19:57:57Z" level=debug msg="do request" request.header.accept="application/vnd.docker.distribution.manifest.v2+json, application/vnd.docker.distribution.manifest.list.v2+json, application/vnd.oci.image.manifest.v1+json, application/vnd.oci.image.index.v1+json, */*" request.header.user-agent=containerd/1.6.0-rc.1+unknown request.method=HEAD
time="2022-07-22T19:57:58Z" level=debug msg="fetch response received" response.header.content-length=1638 response.header.content-type=application/vnd.docker.distribution.manifest.list.v2+json response.header.date="Fri, 22 Jul 2022 19:57:57 GMT" response.header.docker-content-digest="sha256:0483110c51a764abe175a8cefb941541ca0bc0f628ce418d3a660338abf4622a" response.header.docker-distribution-api-version=registry/2.0 response.header.docker-ratelimit-source=2b697055-3ede-42a8-a8ce-addb3ad1c7fb response.header.etag="\"sha256:0483110c51a764abe175a8cefb941541ca0bc0f628ce418d3a660338abf4622a\"" response.header.strict-transport-security="max-age=31536000" response.status="200 OK"
time="2022-07-22T19:57:58Z" level=debug msg=resolved desc.digest="sha256:0483110c51a764abe175a8cefb941541ca0bc0f628ce418d3a660338abf4622a"
time="2022-07-22T19:57:58Z" level=debug msg=fetch
time="2022-07-22T19:57:58Z" level=debug msg=fetch
time="2022-07-22T19:57:58Z" level=debug msg=fetch
time="2022-07-22T19:57:58Z" level=debug msg=fetch
time="2022-07-22T19:57:58Z" level=debug msg=fetch
time="2022-07-22T19:57:58Z" level=debug msg=fetch
time="2022-07-22T19:57:58Z" level=debug msg=fetch
time="2022-07-22T19:57:58Z" level=debug msg=fetch

*** here is where we experience a 5 second pause ***

time="2022-07-22T19:58:03Z" level=debug msg="reusing ref for local: w0kjgax7l4t9jv6huwhbyx739" span="[context .] local context ."
time="2022-07-22T19:58:03Z" level=debug msg="diffcopy took: 1.856012ms" span="[context .] local context ."
time="2022-07-22T19:58:03Z" level=debug msg="load cache for [+hello 564660e1-200e-4c7b-ba4b-c43b5600eb3b] RUN cat data | md5sum > my_output with baavvvivnr59mbfovdync6t3n::jzc8q99dtd1p85vrrpyrusnqd"
time="2022-07-22T19:58:03Z" level=debug msg="load cache for [+hello 564660e1-200e-4c7b-ba4b-c43b5600eb3b] SAVE ARTIFACT my_output +hello/my_output AS LOCAL my_output with baavvvivnr59mbfovdync6t3n::675z10is37112wovyd7rqs5lc"
time="2022-07-22T19:58:03Z" level=debug msg="session finished: <nil>"
time="2022-07-22T19:58:03Z" level=error msg="reading from connection failed: EOF" app=shellrepeater

@alexcb
Copy link
Collaborator

alexcb commented Nov 3, 2022

This might be showing up in #2307 (comment) where failed commands each introduce a 5 second delay.

@rrjjvv
Copy link
Contributor

rrjjvv commented Nov 13, 2022

I've identified a fix for this. I'm not super confident it's a solution though; it may be just addressing a symptom of a bug elsewhere in the system. Though a one-line fix will do the trick, I thought my two-line was "better".

The --no-cache times are far from scientific of course, but the cached versions should have been fully cached (as they ran immediately after their --no-cache counterparts). The two times are the ones emitted by -V (so 'processing' time and 'real' time):

Invocation release patched
--no-cache -V +for-linux 4m17s / 1m27s 2m34s / 1m25s
- V +for-linux 1m50 / 13s 3s / 2s
--no-cache -V -P +test 48m54s / 7m19s 32m28s / 5m52s
-V -P +test 18m12s / 1m57 2m33s / 18s

The cached times are insanely different. The "cpu" times look even more dramatic, but the 'real' times are what we care about, and still very substantial (13s -> 2s for building earthly, and 2m -> 18s for running its tests).

I had always suspect (per my previous comment) that the part responsible for the delay was this delay. Changing that value is an easy way to validate that. One fix would be to decrease that, but it's not a very good solution. Removing the timeout altogether is not an option (without other changes).

My solution (which yielded those results above) was to change this small block to rrjjvv/buildkit@7bb02cd

With the addition of extra logging, I found zero occurrences of the requested session id ever actually being in the map. Since it appeared destined to always fail, it seemed like the best option. From what I could gather, that only a new session can release that Wait.

Since the pertinent code doesn't seem to deviate from upstream buildkit in any significant way, my only guesses are 1) something within earthly is artificially stopping the creation of a new session that would unblock that Wait under normal circumstances, or 2) a suboptimal/incorrect id is being used (where the 'correct' id would result in getting a client from the map and not incurring the Wait). I don't know enough about earthly or buildkit to guess which is the case, if there are other explanations, or if I'm totally off-base. What I do know is that two-line change insanely speeds up the build, and without any obvious repercussions.

The only thing stopping me from creating a PR is not fully knowing if this is a true solution or just a band-aid. (The fact that my fix is in the buildkit fork, and around code that doesn't deviate too much from upstream, is another reason I think the real fix may actually lie elsewhere.) If the latter, hopefully it's a pointer in the direction of a permanent/correct fix!

@alexcb
Copy link
Collaborator

alexcb commented Nov 15, 2022

great find @rrjjvv !

It would appear that this slowness happens in the upstream moby/buildkit source too.

When no sessionID is specified on the llb.Local call, the localSourceHandler will do a quick return at https://github.com/moby/buildkit/blob/b3e8c63a48ad8c015f5631fc1947945b229b3919/source/local/local.go#L92

	src := llb.Local("cwd")

	dest := llb.Image("alpine:latest").
		File(llb.Mkdir(workDir, os.ModeDir)).
		Dir(workDir).
		File(llb.Copy(src, "src.txt", "/copy-example/dest")).
		Run(llb.Shlex("cat /copy-example/dest")).Root()

however if I try llb.Local("cwd", llb.SessionID("my-session-id"), I end up with a blocking call that eventually times out with no active session for local-session-id: context deadline exceeded.

I will have to dig in deeper to see if there's a mistake in how we use SessionID.

@rrjjvv
Copy link
Contributor

rrjjvv commented Nov 15, 2022

As you probably saw (or will see), the noWait flag is only true at one place in the buildkit codebase: a GC routine. And that routine was the reason for adding that flag (and the current Wait implemented existed prior). That's also why I thought my solution was a band-aid (and not the real fix).

After taking another peek, my guess isn't that you're using the SessionIDs wrong per-se (but maybe you are). But more that it started being used in ways that you didn't anticipate. Prior to moby/buildkit@b3e8c63, the SessionID wasn't referenced at all for local snapshots (nor was that session->client map referenced).

Anecdotally, the "shape" of ids that are present in the map, i.e. as a result of new sessions, are quite a bit different than the ids giving us problems. The ones in the map are relatively short and all lowercase, whereas the cache-miss ids are noticeably longer and mixed case. So that does hint at incorrect usage, but just a hint.

Just glad I could help!

@alexcb
Copy link
Collaborator

alexcb commented Nov 16, 2022

I have a potential fix under #2413 which my intial repro-case in approximately 1second.

I also tried a - V +for-linux (after the cache is built up), and got comparable results:

real	0m4.876s
user	0m0.781s
sys	0m0.350s

and -V -P +test (after cache is warmed) took:

real	0m58.121s
user	0m4.881s
sys	0m1.183s

@vladaionescu
Copy link
Member Author

This is now complete & released 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:performance Performance improvement request or issue
Projects
None yet
Development

No branches or pull requests

3 participants