-
Notifications
You must be signed in to change notification settings - Fork 386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic: failed to get edge #1016
Comments
Thanks for the bug report @jamesalucas. Your setup should work fine - buildkit is able to take parallel requests, even if the jobs are the same. I'll see if there's anything obvious and if not, I'll report upstream. |
I tried running multiple instances of earthly v0.5.13 against a few targets and unfortunately wasn't able to replicate it. @jamesalucas to help us replicate the issue:
|
Hello, the project isn't hosted publicly but I have created a slimmed down version with a few bits removed but hopefully it will help. The main difference between this and the full version are:
Our CI has 2 jobs for each package with 3 Earthly invocations:
Invocation 2) Build a Docker image
(No Invocation 3) Run e2e Tests
I've attached it to this comment. We are (ab)using secrets to get proxy info to commands that need it as I couldn't find a way to get global args to work as nicely as secrets did. We're not using the We're not using Hope that helps, thanks for the help! |
We've had a few more of these kind of errors, not sure if this helps but the trace appears to be slightly different to the previous one, stating that it
|
Thanks for the code and the extra details, very helpful! I believe this earthly/buildkit#42 will fix the Of the three steps you mentioned, is there a step that you think is causing more failures than others? Or is it too hard to tell because of the parallel pipelines all throwing work at buildkit at the same time? |
Most of the failures seem to happen in step one (where we run Earthly for the The Gitlab runner was running RHEL7 and then an Alpine Docker image in which Earthly ran. I've created a second runner based off of RHEL8 and upgraded to Earthly 0.5.15 and whilst we've only done a few builds, we haven't seen the panic issue yet, but got this error instead:
I will continue to monitor for the panic issue but at the moment I'm just trying different things to get it stable. Is it useful to add them onto this? If not, I won't! Thanks |
Hi @vladaionescu , I wonder if I can get some further thoughts on this... We're now using Earthly as part of a Gitlab pipeline that has just under 100 separate jobs with each job containing between 1 and 2 Earthly invocations - so probably 150 invocations per run of the pipeline. We have fairly large private Gitlab runner with 20 CPUs, 80GB RAM and 500GB of disk space (as we want to leverage layer caching as much as possible) and the runner is set to run up to 10 jobs concurrently. Most of the time it's working well and but maybe 10% of jobs fail with what appear to be recoverable errors. This has resulted in me wrapping Earthly in a shell script which greps the logs for recoverable errors and retries the same target if one occurs. The list of errors I've had so far and it retries upon is:
I am not a Go coder but looking at the stack traces it looks as if buildkit is often performing cache garbage collection when it crashes, even though there is still 325GB of free space available that has been allocated to the cache. Is this expected? I would like to see if disabling cache garbage collection improves stability, but can't work out how to do this. I've tried setting
Whilst it does make it into the config, the result is multiple Do you think this is worth trying and if so, is it possible to disable cache gc via Earthly? If you have any suggestions as to how to investigate this further or improve stability I would love to try something out! Thanks! |
Hi @jamesalucas, There seem to be various error types in there, each of which could be caused by different interactions. It is possible that one job encounters one type of error and then the rest of the jobs that may be taking place in parallel get various other types of errors. But I can't be sure, given that this seems to only reproduce at scale. From what I can tell, the following could be root causes:
While the rest could be secondary reactions to these causes. The reasons why these things take place can vary a lot. One way to "turn off" GCing could be to set it to a very large size. But given that these errors vary so much, I suspect that that's not necessarily the root cause of it. Here are some possible reasons that come to mind:
Overall, I think in order to be able to address some of these, we need a way to consistently reproduce the errors. If you find a minimal setup in which you can reproduce any of these with some degree of consistency, that would be really really helpful for us to investigate further. If not, the alternative is for us to wait for these to be fixed upstream in BuildKit eventually - but it's unknown when/if these will be addressed, given that they are difficult to get. Zooming out, however, perhaps something to look into might be a way for BuildKit to recover on its own when these things occur. I think some of the secondary failures that you're seeing are caused by BuildKit crashing and canceling everything else, which is undesirable. This is something for us to possibly work out with the BuildKit team. Separately, it might be helpful, in case you wanted that, to start and manage buildkit independently from Earthly. In the latest release, we have new Docker images to start buildkit on its own and possibly put it on a restart loop (in case it crashes). This should help with some of the trampling that might take place due to multiple threads attempting to start buildkit at the same time. It wouldn't fix all the issues you listed, but it might make a few of them occur more rarerly. @dchw could give you a hand on our community Slack if you need any help in this direction. |
One more thought re: |
Thank you for the detailed reply, I'll try and answer each point you raised...
We're using the v0.5.17. I was browsing moby/buildkit over the weekend and noticed a few commits which look to address these kinds of issues... Do you think they will get merged into earthly/buildkit in the near future? If I get some time I might have a go at building an image with them in myself to see if it helps things.
We're not using that yet, but I intend to introduce that soon to see if it helps with our CI build times. I have spent some time trying to reproduce this outside of our CI but so far I have been unable to. If I succeed I will be sure to post back here.
This is really interesting so I will look into this, not only for this issue but I had also been wondering about this for caching... I had wondered if it's possible to create Earthly buildkitd pods in a Kubernetes cluster and share them all via a service. If I did this, would I be able to share the cache volume between them all or is there an assumption that one buildkitd process has exclusive use of the cache volume? My thinking is that instead of having multiple Gitlab CI runners each with their own buildkit process and associated cache, each runner could just point to the central K8s cluster and the cache would be shared. A bit like a Bazel build farm? If I get round to this I will hop onto Slack. Thank you! |
We should have those merged in pretty soon - we generally keep very close with buildkit main branch. @alexcb might help with a merge & release this following week. |
We are planning on something like this as a feature, but it'll take a while until we have it working as nicely as you describe it. The cache cannot be shared between buildkit instances as it is, unfortunately. Similar to the docker cache, it was designed to only be used by one process at a time. |
We have updated earthly's version of buildkit to include the latests upstream changes under https://github.com/earthly/earthly/releases/tag/v0.5.18 Lets see if this resolves the panics. |
Thanks very much! I upgraded yesterday and I am seeing fewer crashes but do still intermittently see
Perhaps this has always been the case and I had not noticed, but I'm wondering whether it is/has been running a scheduled job and could crash whilst doing so and this may or may not happen whilst a build is in progress? |
This doesn't appear to be happening after the recent buildkit merges, thanks! |
I've recently upgraded to Earthly 0.5.13 and our CI jobs that run multiple Earthly targets in parallel have started failing with errors such as this:
Similar logs in the job that is in progress at the same time...
Should running jobs in parallel on the same machine be OK? They are both run in Gitlab via Docker executors but on the same Docker host so the buildkitd instance is shared.
The text was updated successfully, but these errors were encountered: