New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: use nested dagger for testdev #7223
base: main
Are you sure you want to change the base?
ci: use nested dagger for testdev #7223
Conversation
Signed-off-by: Justin Chadwell <me@jedevc.com>
781aef2
to
0d74b67
Compare
Specifically! This doesn't spin up a docker container with a dev version of dagger, it spins up a dagger service with a dev version of dagger - dagger in dagger! This simplifies deployment to the new CI runners. Signed-off-by: Justin Chadwell <me@jedevc.com>
Signed-off-by: Justin Chadwell <me@jedevc.com>
0d74b67
to
8f01bb7
Compare
This should allow nested networking to actually work. Signed-off-by: Justin Chadwell <me@jedevc.com>
I finally caught up with this. I was unable to get multi.go:85: 7: Error: failed to serve module: input: module.withSource.initialize resolve: failed to initialize module: failed to add object to module "test": failed to vali
date type def: cannot define function with reserved name "id" on object "Test"
multi.go:85: multi.go:85: 8: in exec dagger --debug query
multi.go:85: 8: 1: in
multi.go:85: 8: dagger --debug query
multi.go:85: 8: 1: Error: failed to serve module: input: module.withSource.initialize resolve: failed to initialize module: failed to add object to module "test": failed to v
alidate type def: cannot define function with reserved name "id" on object "Test"
multi.go:85: 8: 17:54:27 DBG frontend exporting spans=1
multi.go:85: 8: 17:54:27 DBG frontend exporting span trace=4b05bbee182cccc87ab152b6bff37243 id=5a38dfed6e7afd9a parent=04b938ceab5c844b span="dagger --debug query"
multi.go:85: 8: 17:54:27 DBG recording span span="dagger --debug query" id=5a38dfed6e7afd9a
multi.go:85: 8: 17:54:27 DBG recording span child span="dagger --debug query" parent=04b938ceab5c844b child=5a38dfed6e7afd9a multi.go:85: 8: 17:54:45 DBG frontend exporting spans=1
multi.go:85: 8: 17:54:45 DBG frontend exporting span trace=4b05bbee182cccc87ab152b6bff37243 id=5a38dfed6e7afd9a parent=04b938ceab5c844b span="dagger --debug query"
multi.go:85: 8: 17:54:45 DBG new end old="0001-01-01 00:00:00 +0000 UTC" new="2024-05-01 17:54:45.914647407 +0000 UTC"
multi.go:85: 8: 17:54:45 DBG recording span span="dagger --debug query" id=5a38dfed6e7afd9a
multi.go:85: 8: 17:54:45 DBG recording span child span="dagger --debug query" parent=04b938ceab5c844b child=5a38dfed6e7afd9a
multi.go:85: 8: 17:54:45 DBG frontend exporting logs logs=1
multi.go:85: 8: 17:54:45 DBG exporting log span=5a38dfed6e7afd9a body="Error: failed to serve module: input: module.withSource.initialize resolve: failed to initialize module
: failed to add object to module \"test\": failed to validate type def: cannot define function with reserved name \"id\" on object \"Test\"\n\n"
multi.go:85: 8: Error: failed to serve module: input: module.withSource.initialize resolve: failed to initialize module: failed to add object to module "test": failed to vali
date type def: cannot define function with reserved name "id" on object "Test"
multi.go:85: 8: Error: failed to serve module: input: module.withSource.initialize resolve: failed to initialize module: failed to add object to module "test": failed to vali
date type def: cannot define function with reserved name "id" on object "Test" I attached all the logs as a I ran it multiple times and keep seeing it fail. I am running it on a different host to double-check. The challenge is that each run task more than As 👆 The elevated This is what the disk reported: I can see a lot of small operations (2k/s), but since utilisation was at 45%, this is not a bottleneck. While bandwidth was never the issue (interface is 10Gbit): I found the (growing) number of open TCP sockets interesting: |
While the test finished in Error: response from query: input: container.from.withExec.withEnvVariable.withExec.withWorkdir.withDirectory.withMountedCache.withExec.withMountedDirectory.withMountedCache.withExec.withMountedDirectory.withMountedFile.withEnvVariable.withServiceBinding.withEnvVariable.withWorkdir.withExec.sync resolve: process "sh -c ./hack/make engine:testimportant" did not complete successfully: exit code: 125 �[0m I dug a bit more and found this: �[90m┃�[0m file: failed to create temp dir: mkdir /tmp/buildkit-mount2491926873: no space left on device kind:*fmt.wrapError stack:<nil>]" �[0m I'm attaching the full logs compressed: engine.testdev.hannibal.2024-05-01.run1.txt.gz Are you able to reproduce the same issue on your workstation @jedevc? |
I am re-running |
@gerhard the first error logs you shared reveal that the tests fail because of a 1hr timeout:
None of the tests here have been running for that long - I suspect for some reason the tests are just running slowly. Will attempt to run this locally to see if I can get similar results, I'm not quite sure I understand the disparity between local+ci here. For the Mac build, I suspect the issue you detect with out-of-space is legitimate - are you using docker desktop? If so, how much space is allocated to the docker desktop disk, and how much is in use? From past experience, the defaults for this are far too low. Also, weird, it's very strange to get this error while creating a file for Also the logs don't seem to be complete. They cut off at the end:
I can't repro an issue where the tests fail but the command returns 0. I do notice that with this approach the size of logs does increase - it appears that we get a lot of repetition, but I'm not entirely sure why this happens. Will attempt to do some investigation here. |
Hitting that As for the macOS host, the Docker Desktop uses the following config:
I am going to bump the disk to 128GB and try again. Current status: There is a single container running: docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a0f0853c63dc registry.dagger.io/engine:v0.11.1 "dagger-entrypoint.s…" 17 hours ago Up 58 seconds dagger-engine-63121e921843d412 FWIW, the 2nd run failed as well: engine.testdev.hannibal.2024-05-01.run2.txt.gz Doing the 3rd run now using this exact command: dagger call --progress=plain --source=. dev --target=. with-exec --args "sh,-c,./hack/make engine:testimportant" 2>&1 | tee engine.testdev.hannibal.2024-05-02.run3.txt |
Out of curiosity, can you successfully run I guess this PR is blocked in the meantime, I also don't seem to complete it in any reasonable amount of time. There's clearly some horrendous performance issues in our tests, that really need a thorough investigation - none of these should be taking this long. |
I am waiting for the results on macOS. It should be done within 10 mins. Will try the same command from The disk is also working overtime - at least 150MB/s constant with some 400MB/s peaks: There is nothing else running on this M1 Max except the command captured above. I am attaching the compressed logs: engine.testdev.hannibal.2024-05-02.run3.txt.gz ( From my POV, this approach is a no-go. It uses too many resources and it keeps failing in different ways across multiple machines. This is what the macOS host looked like while running this test only: I have no idea how it passes in CI 🤷♂️ I would like to stop digging in this direction and try something else instead. See my later comments. |
What are your thoughts @jedevc on leaving this as is, and instead continue running this in Docker, as we have been so far, but in larger GitHub runners? |
Yeah, that's fine - I'm not quite sure what specifically in this setup makes this different locally on different machines. We can leave this open, and I'll come back and pick it up later. |
@jedevc does the following command pass for you locally, on ./hack/dev
./hack/with-dev dagger call --progress=plain --source=.:default --host-docker-config=file:$HOME/.docker/config.json test important --race=true 2>&1 | tee engine.testdev.2024-05-02.main.run1.txt
As I finished typing that, it failed: Attaching my full output gzipped: engine.testdev.hannibal.2024-05-02.main.run1.txt.gz |
Managed to track down why we have a huge log explosion, which is definitely not helping. In this new setup, we have:
Now, all of these use This makes the logs... tricky to read. But also, because of the layering, we essentially get an x8 multiplier to the size of our client logs, with tons of redundant info. |
That is very helpful to know @jedevc! Seems that we are getting somewhere 💪 I was able to set up a clean Ubuntu
This is effectively the Premium Intel 8 vCPUs variant - a.k.a.
I installed:
And then ran the following commands:
First run succeeded in This is what the system metrics looked like: 👆 Load reached 👆While the disk hits 150MB/s, it's not saturated - Second run failed in System metrics looked exactly the same, no change. Next steps:
|
I re-ran this yesterday on 👆 engine.testdev.2024-05-02.main.run3.txt.gz 👆 engine.testdev.2024-05-02.main.run4.txt.gz The good news is that it took half the time of I am running this again on a fresh instance. I will run it twice:
I will capture the duration as well as the system resource usage (cpu, load & disk throughput + utilization). If this change is within If it isn't, I am moving onto other PRs, specifically: |
Before this change ✅
|
Large GitHub Runners are failing consistently, not worth debugging at this point since we know this works on a vanilla Ubuntu 24.04 instance with Docker - must be an issue related to GitHub Large Runners. FTR: dagger#7223 (comment) Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
* Use Dagger v0.11.2 via the new CI setup for all workflows except dev-engine This one requires Docker with specific fixes that we don't yet have in the new CI setup. Signed-off-by: Gerhard Lazu <gerhard@dagger.io> * Setup CI for new, legacy and vertical scaling The setup we want for production is: - For all <LANG> SDK jobs, run them on the new CI only - For testdev, run them on the docker-fix legacy CI - For test/dagger-runner, run them on both legacy CI and new CI - For all the rest, run them on the new CI and github runners for the really simple jobs Signed-off-by: Matias Pan <matias@dagger.io> * Rename concurrency group Signed-off-by: Matias Pan <matias@dagger.io> * Install curl on production vertical scaling runner Signed-off-by: Matias Pan <matias@dagger.io> * Add customizable runner for separate perf tests Signed-off-by: Matias Pan <matias@dagger.io> * Rename to _async_hack_make Signed-off-by: Matias Pan <matias@dagger.io> * Upgrade missing workflow to v0.11.1 Signed-off-by: Matias Pan <matias@dagger.io> * Target nvme Signed-off-by: Matias Pan <matias@dagger.io> * CI: Default to 4CPUs & NVMe disks Otherwise the workflows are too slow on the new CI runners and are blocking the migration off the legacy CI runners. Signed-off-by: Gerhard Lazu <gerhard@dagger.io> * Bump to v0.11.2 & capture extra details in comments Signed-off-by: Gerhard Lazu <gerhard@dagger.io> * Debug dagger-engine.dev in large GitHub Runner Signed-off-by: Gerhard Lazu <gerhard@dagger.io> * Continuer running engine:testdev in dagger-runner-docker-fix runner Large GitHub Runners are failing consistently, not worth debugging at this point since we know this works on a vanilla Ubuntu 24.04 instance with Docker - must be an issue related to GitHub Large Runners. FTR: #7223 (comment) Signed-off-by: Gerhard Lazu <gerhard@dagger.io> --------- Signed-off-by: Gerhard Lazu <gerhard@dagger.io> Signed-off-by: Matias Pan <matias@dagger.io> Co-authored-by: Gerhard Lazu <gerhard@dagger.io> Co-authored-by: Matias Pan <matias@dagger.io>
* Use Dagger v0.11.2 via the new CI setup for all workflows except dev-engine This one requires Docker with specific fixes that we don't yet have in the new CI setup. Signed-off-by: Gerhard Lazu <gerhard@dagger.io> * Setup CI for new, legacy and vertical scaling The setup we want for production is: - For all <LANG> SDK jobs, run them on the new CI only - For testdev, run them on the docker-fix legacy CI - For test/dagger-runner, run them on both legacy CI and new CI - For all the rest, run them on the new CI and github runners for the really simple jobs Signed-off-by: Matias Pan <matias@dagger.io> * Rename concurrency group Signed-off-by: Matias Pan <matias@dagger.io> * Install curl on production vertical scaling runner Signed-off-by: Matias Pan <matias@dagger.io> * Add customizable runner for separate perf tests Signed-off-by: Matias Pan <matias@dagger.io> * Rename to _async_hack_make Signed-off-by: Matias Pan <matias@dagger.io> * Upgrade missing workflow to v0.11.1 Signed-off-by: Matias Pan <matias@dagger.io> * Target nvme Signed-off-by: Matias Pan <matias@dagger.io> * CI: Default to 4CPUs & NVMe disks Otherwise the workflows are too slow on the new CI runners and are blocking the migration off the legacy CI runners. Signed-off-by: Gerhard Lazu <gerhard@dagger.io> * Bump to v0.11.2 & capture extra details in comments Signed-off-by: Gerhard Lazu <gerhard@dagger.io> * Debug dagger-engine.dev in large GitHub Runner Signed-off-by: Gerhard Lazu <gerhard@dagger.io> * Continuer running engine:testdev in dagger-runner-docker-fix runner Large GitHub Runners are failing consistently, not worth debugging at this point since we know this works on a vanilla Ubuntu 24.04 instance with Docker - must be an issue related to GitHub Large Runners. FTR: dagger#7223 (comment) Signed-off-by: Gerhard Lazu <gerhard@dagger.io> --------- Signed-off-by: Gerhard Lazu <gerhard@dagger.io> Signed-off-by: Matias Pan <matias@dagger.io> Co-authored-by: Gerhard Lazu <gerhard@dagger.io> Co-authored-by: Matias Pan <matias@dagger.io>
Specifically! This doesn't spin up a docker container with a dev version of dagger, it spins up a dagger service with a dev version of dagger - dagger in dagger!
This simplifies deployment to the new CI runners (since docker is no longer needed).
At some point, it would be nice to have the
./hack/dev
script use this as well - however, I'm not 100% sure how this would work with other core devs' workflows. Personally, I periodically run./hack/dev
to build and restart the docker engine, and use./hack/with-dev
to run commands in that context. This doesn't work as well with this setup, where it's more difficult to observe the logs in a running service, etc.So maybe we need to keep
./hack/dev
at least for now - we can discuss and maybe do this in the future though.