Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degradation after 0.7.2 #3085

Closed
rrjjvv opened this issue Jul 11, 2023 · 1 comment
Closed

Performance degradation after 0.7.2 #3085

rrjjvv opened this issue Jul 11, 2023 · 1 comment
Labels
type:bug Something isn't working

Comments

@rrjjvv
Copy link
Contributor

rrjjvv commented Jul 11, 2023

Recap of Slack conversation

What went wrong?
Original observation: when running earthly tests (+test against main @ 8bb5634), executions that were cached took a long time to finish. On every execution, there was a significant delay between "exporting outputs" and the actual completion of the build. Shortly after the "exporting outputs" message, the "Waiting on Buildkit... (Utilization: 0 other builds, 0/20 op load)" message appears until the build ends. During this delay, which is measured in minutes (and accounting for half of the total build time), CPU was noticeably/consistently elevated, and memory utilization quadrupled. The memory eventually dropped a few minutes after the build was complete.

The behavior itself is 100% reproducible. However, to this point it has only been noticeable while running the earthly tests, and when "fully cached". It has not been noticeable or detrimental in my typical day-to-day usage, though I suspect the problem is still there.

Earthly's own Earthfile is the best reproducer at this point. However, in order to get an apples-to-apples comparison across versions (see below), some specific tests had to be disabled. I can provide the diff, but I don't think it's relevant.

What should have happened?
The expectation is that "fully cached" builds are fast and that delays have logical explanations.

So far, the only workaround appears to be downgrading to version 0.7.2.

Other Helpful Information
After experimentation and discussion with others, I believe this is due to buildkit changes between the earthly 0.7.2 and 0.7.4 releases, and not the earthly binary itself. My original observations called out increased memory and CPU, but I now believe that increased disk writes is the real issue (with memory/CPU being symptoms, or having a common root cause).

At one point, the test suite took only 18 seconds when fully cached, with 0.6.30 being the first release to include those changes. Therefore I started my reproduction/investigation with a 0.6.30 client and corresponding tag as a baseline, with buildkit version being variable.

In both tables, only builkit versions 0.7.2 and 0.7.3 are relevant, but additional versions are included to help illustrate the patterns. The last three columns are verbatim from the output of docker stats for the buildkit container. My original observations were on my corporate laptop (with lots of security software). To aid in isolation/consistency, these numbers were with my laptop as the client, with a remote server (on the same LAN).

Stats for "fresh" builds (empty buildkit cache, but fully populated registry cache):

Client/Code version Buildkit version (Wall) time Net I/O Disk I/O PIDs
0.6.30 0.6.30 8m33s 2.44GB / 184MB 0B / 14.1GB 35
0.6.30 0.7.0 8m44s 2.44GB / 184MB 8.19kB / 14.3GB 28
0.6.30 0.7.2 8m59s 2.44GB / 186MB 0B / 14.2GB 31
0.6.30 0.7.3 9m10s 2.44GB / 185MB 0B / 15GB 21
0.6.30 0.7.4 9m11s 2.44GB / 186MB 0B / 15.4GB 23
0.6.30 0.7.5 9m19s 2.44GB / 185MB 0B / 15.1GB 22
0.6.30 0.7.9 8m59s 2.44GB / 184MB 221kB / 15.2GB 24

The main takeaway is that there was almost a (consistent) 1GB increase in disk writing. There was an increase in wall time, but possibly not statistically significant. And notably, there was no obvious delay at the end of the build (no "Waiting on Buildkit").

Stats for "fully cached" builds (fully populated buildkit cache, fully populated registry cache). Note that some tests had RUN --no-cache, so the baseline is still somewhat elevated:

Client/Code version Buildkit version (Wall) time Net I/O Disk I/O PIDs
0.6.30 0.6.30 2m14s 348MB / 161MB 0B / 402MB 25
0.6.30 0.7.0 2m24s 348MB / 160MB 369kB / 399MB 26
0.6.30 0.7.2 2m24s 348MB / 161MB 0B / 404MB 25
0.6.30 0.7.3 3m31s 348MB / 161MB 0B / 3.34GB 33
0.6.30 0.7.4 3m37s 348MB / 160MB 0B / 3.64GB 32
0.6.30 0.7.5 3m25s 348MB / 161MB 0B / 3.3GB 32
0.6.30 0.7.9 3m20s 348MB / 160MB 0B / 3.34GB 35

The increased disk writes is extremely obvious. The increased times are significant as well, but is made worse (to a human) due to every one resulting in "Waiting on Buildkit" at the end. The PID counts are noticeably and consistently elevated, but may or may not be relevant. In each of the "fresh"/"cached" scenarios, the network I/O was extremely consistent, implying that neither build context nor exports are factors.

After shifting focus to disk writes, some other things of interest were found. There is one very specific system call (pwrite64) that stood out. An strace of a fully-cached build with 0.7.2 had 78238 calls, where a fully-cached 0.7.4 build had 467255. Just as significantly, the majority of those 467k writes occurred during the noticeable lag at the end of the build. As well, the predominant file descriptor being written during that time corresponds to /tmp/earthly/buildkit/runc-overlayfs/containerdmeta.db.

I have asciinema casts (to illustrate the lags that a human sees) and strace dumps, but those are likely too large to attach here. Hopefully there's enough here to go on (and hopefully reproducible by others).

@rrjjvv rrjjvv added the type:bug Something isn't working label Jul 11, 2023
dchw added a commit that referenced this issue Jul 26, 2023
This tests, and implements earthly/buildkit#20
as a partial performance resolution to #3085. Some of the higher
resource utilization at the end exists (per user reports testing this
fix), but the pauses appear to be gone.
@vladaionescu
Copy link
Member

I believe that this has since been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

2 participants