ci: use nested dagger for testdev #7223

jedevc · 2024-04-30T13:39:35Z

Specifically! This doesn't spin up a docker container with a dev version of dagger, it spins up a dagger service with a dev version of dagger - dagger in dagger!

This simplifies deployment to the new CI runners (since docker is no longer needed).

At some point, it would be nice to have the ./hack/dev script use this as well - however, I'm not 100% sure how this would work with other core devs' workflows. Personally, I periodically run ./hack/dev to build and restart the docker engine, and use ./hack/with-dev to run commands in that context. This doesn't work as well with this setup, where it's more difficult to observe the logs in a running service, etc.

So maybe we need to keep ./hack/dev at least for now - we can discuss and maybe do this in the future though.

Signed-off-by: Justin Chadwell <me@jedevc.com>

Specifically! This doesn't spin up a docker container with a dev version of dagger, it spins up a dagger service with a dev version of dagger - dagger in dagger! This simplifies deployment to the new CI runners. Signed-off-by: Justin Chadwell <me@jedevc.com>

Signed-off-by: Justin Chadwell <me@jedevc.com>

This should allow nested networking to actually work. Signed-off-by: Justin Chadwell <me@jedevc.com>

gerhard · 2024-05-01T18:32:41Z

I finally caught up with this. I was unable to get dagger call --source=. dev --target=. with-exec --args "sh,-c,./hack/make engine:testimportant" passing locally:

    multi.go:85: 7: Error: failed to serve module: input: module.withSource.initialize resolve: failed to initialize module: failed to add object to module "test": failed to vali
date type def: cannot define function with reserved name "id" on object "Test"
    multi.go:85:                                                                                                                                                                      multi.go:85: 8: in exec dagger --debug query
    multi.go:85: 8: 1: in
    multi.go:85: 8: dagger --debug query
    multi.go:85: 8: 1: Error: failed to serve module: input: module.withSource.initialize resolve: failed to initialize module: failed to add object to module "test": failed to v
alidate type def: cannot define function with reserved name "id" on object "Test"
    multi.go:85: 8: 17:54:27 DBG frontend exporting spans=1
    multi.go:85: 8: 17:54:27 DBG frontend exporting span trace=4b05bbee182cccc87ab152b6bff37243 id=5a38dfed6e7afd9a parent=04b938ceab5c844b span="dagger --debug query"
    multi.go:85: 8: 17:54:27 DBG recording span span="dagger --debug query" id=5a38dfed6e7afd9a
    multi.go:85: 8: 17:54:27 DBG recording span child span="dagger --debug query" parent=04b938ceab5c844b child=5a38dfed6e7afd9a                                                      multi.go:85: 8: 17:54:45 DBG frontend exporting spans=1
    multi.go:85: 8: 17:54:45 DBG frontend exporting span trace=4b05bbee182cccc87ab152b6bff37243 id=5a38dfed6e7afd9a parent=04b938ceab5c844b span="dagger --debug query"
    multi.go:85: 8: 17:54:45 DBG new end old="0001-01-01 00:00:00 +0000 UTC" new="2024-05-01 17:54:45.914647407 +0000 UTC"
    multi.go:85: 8: 17:54:45 DBG recording span span="dagger --debug query" id=5a38dfed6e7afd9a
    multi.go:85: 8: 17:54:45 DBG recording span child span="dagger --debug query" parent=04b938ceab5c844b child=5a38dfed6e7afd9a
    multi.go:85: 8: 17:54:45 DBG frontend exporting logs logs=1
    multi.go:85: 8: 17:54:45 DBG exporting log span=5a38dfed6e7afd9a body="Error: failed to serve module: input: module.withSource.initialize resolve: failed to initialize module
: failed to add object to module \"test\": failed to validate type def: cannot define function with reserved name \"id\" on object \"Test\"\n\n"
    multi.go:85: 8: Error: failed to serve module: input: module.withSource.initialize resolve: failed to initialize module: failed to add object to module "test": failed to vali
date type def: cannot define function with reserved name "id" on object "Test"
    multi.go:85: 8: Error: failed to serve module: input: module.withSource.initialize resolve: failed to initialize module: failed to add object to module "test": failed to vali
date type def: cannot define function with reserved name "id" on object "Test"

I attached all the logs as a 17MB gzip: engine.testdev.x22.2024-05-01.run1.txt.gz. The logs are 215MB uncompressed.

I ran it multiple times and keep seeing it fail. I am running it on a different host to double-check.

The challenge is that each run task more than 60mins (and only 16mins in CI).

As dagger call --source=. dev --target=. with-exec --args "sh,-c,./hack/make engine:testimportant" is running, this is what the machine looked like in that period:

👆 The elevated iowait made me want to check on the disk:

This is what the disk reported:

I can see a lot of small operations (2k/s), but since utilisation was at 45%, this is not a bottleneck.

While bandwidth was never the issue (interface is 10Gbit):

I found the (growing) number of open TCP sockets interesting:

FWIW:

gerhard · 2024-05-01T18:46:39Z

I ran the same on:

While the test finished in 15mins and it appeared to pass (command exited with 0) these lines in the logs are making me suspicious:

Error: response from query: input: container.from.withExec.withEnvVariable.withExec.withWorkdir.withDirectory.withMountedCache.withExec.withMountedDirectory.withMountedCache.withExec.withMountedDirectory.withMountedFile.withEnvVariable.withServiceBinding.withEnvVariable.withWorkdir.withExec.sync resolve: process "sh -c ./hack/make engine:testimportant" did not complete successfully: exit code: 125                                                                                                                                                                                                        �[0m

I dug a bit more and found this:

          �[90m┃�[0m file: failed to create temp dir: mkdir /tmp/buildkit-mount2491926873: no space left on device kind:*fmt.wrapError stack:<nil>]"                                                                                                                                                                 �[0m

I'm attaching the full logs compressed: engine.testdev.hannibal.2024-05-01.run1.txt.gz

Are you able to reproduce the same issue on your workstation @jedevc?

gerhard · 2024-05-01T18:48:09Z

I am re-running dagger call --source=. dev --target=. with-exec --args "sh,-c,./hack/make engine:testimportant" and also capturing the system stats on the Mac. Also re-running on the Linux workstation to see if the 1h behaviour is consistent.

jedevc · 2024-05-02T10:18:53Z

@gerhard the first error logs you shared reveal that the tests fail because of a 1hr timeout:

�[95m9: �[0m�[95m13: �[0mpanic: test timed out after 1h0m0s
�[95m9: �[0m�[95m13: �[0mrunning tests:
�[95m9: �[0m�[95m13: �[0m	TestModuleConstructor/basic/go (2m33s)
�[95m9: �[0m�[95m13: �[0m
�[95m9: �[0m	TestModuleConstructor/basic/python (2m12s)
�[95m9: �[0m�[95m13: �[0m	TestModuleConstructor/basic/typescript (2m31s)
�[95m9: �[0m�[95m13: �[0m
�[95m9: �[0m	TestModuleDaggerCallArgTypes/directory_arg_inputs/local_dir/rel_path (1m38s)
�[95m9: �[0m�[95m13: �[0m	TestModuleReservedWords/id/arg/python (1m6s)
�[95m9: �[0m�[95m13: �[0m
�[95m9: �[0m	TestModuleReservedWords/id/arg/typescript (57s)
�[95m9: �[0m�[95m13: �[0m	TestModuleReservedWords/id/field/typescript (1m6s)
�[95m9: �[0m�[95m13: �[0m	TestModuleReservedWords/id/fn/python (47s)

None of the tests here have been running for that long - I suspect for some reason the tests are just running slowly. Will attempt to run this locally to see if I can get similar results, I'm not quite sure I understand the disparity between local+ci here.

For the Mac build, I suspect the issue you detect with out-of-space is legitimate - are you using docker desktop? If so, how much space is allocated to the docker desktop disk, and how much is in use? From past experience, the defaults for this are far too low.

Also, weird, it's very strange to get this error while creating a file for /tmp - how much RAM is associated?

Also the logs don't seem to be complete. They cut off at the end:

�[90m┃�[0m �[0m�[32mPASS�[0m core/integration.TestModuleCodegenOnDepChange/python_uses_go (142.06s)�[0m                                                                                                                                                                                                                               �[0m
�[90m┃�[0m �[0m�[32mPASS�[0m core/integration.TestModulePythonProjectLayout/hatch/main.py (46.10s)�[0m                                                                                                                                                                                                                                �[0m
�[90m┃�[0m �[0m�[32mPASS�[0m core/integration.TestMo

I can't repro an issue where the tests fail but the command returns 0.

I do notice that with this approach the size of logs does increase - it appears that we get a lot of repetition, but I'm not entirely sure why this happens. Will attempt to do some investigation here.

gerhard · 2024-05-02T10:46:22Z

Hitting that 1h timeout is completely unexpected. This is a Linux 6.6.28 host with 16 CPUs (AMD Ryzen 7 5800X) , 64GB of DDR4 & 1TB NVMe (980 PRO). It's as close as it gets to the CI runner hosts. The only suspect that I have at this point is Docker 24.0.5 on NixOS 24.05. Would be good to know what results you see on your Linux host. I will try running this again on another host.

As for the macOS host, the Docker Desktop uses the following config:

10CPUs
16GB of RAM
64GB of virtual disk (Docker Desktop was reporting 45GB available)

I am going to bump the disk to 128GB and try again. Current status:

There is a single container running:

docker ps
CONTAINER ID   IMAGE                               COMMAND                  CREATED        STATUS          PORTS     NAMES
a0f0853c63dc   registry.dagger.io/engine:v0.11.1   "dagger-entrypoint.s…"   17 hours ago   Up 58 seconds             dagger-engine-63121e921843d412

FWIW, the 2nd run failed as well: engine.testdev.hannibal.2024-05-01.run2.txt.gz

Doing the 3rd run now using this exact command:

dagger call --progress=plain --source=. dev --target=. with-exec --args "sh,-c,./hack/make engine:testimportant" 2>&1 | tee engine.testdev.hannibal.2024-05-02.run3.txt

jedevc · 2024-05-02T10:49:56Z

Out of curiosity, can you successfully run main's testdev on your machine successfully?

I guess this PR is blocked in the meantime, I also don't seem to complete it in any reasonable amount of time. There's clearly some horrendous performance issues in our tests, that really need a thorough investigation - none of these should be taking this long.

gerhard · 2024-05-02T11:03:43Z

I am waiting for the results on macOS. It should be done within 10 mins. Will try the same command from main after.

Current status:

The disk is also working overtime - at least 150MB/s constant with some 400MB/s peaks:

There is nothing else running on this M1 Max except the command captured above.

I've got my result:

I am attaching the compressed logs: engine.testdev.hannibal.2024-05-02.run3.txt.gz (19MB compressed & 244MB plain).

From my POV, this approach is a no-go. It uses too many resources and it keeps failing in different ways across multiple machines. This is what the macOS host looked like while running this test only:

I have no idea how it passes in CI 🤷‍♂️

I would like to stop digging in this direction and try something else instead. See my later comments.

gerhard · 2024-05-02T11:04:58Z

Out of curiosity, can you successfully run main's testdev on your machine successfully?

Running this now on the Linux machine. Current status:

dagger-0.11.1 call --progress=plain --source=.:default --host-docker-config=file:/home/gerhard/.docker/config.json test all --race=true 2>&1 | tee engine.testrace.x22.2024-05-02.run1.txt

FTR, this is how I found my 3rd run of this on the Linux host:

Hit the infamous issue again:

Intermittent failed to get state for index errors #6111

gerhard · 2024-05-02T11:17:32Z

What are your thoughts @jedevc on leaving this as is, and instead continue running this in Docker, as we have been so far, but in larger GitHub runners?

jedevc · 2024-05-02T11:48:40Z

Yeah, that's fine - I'm not quite sure what specifically in this setup makes this different locally on different machines.

We can leave this open, and I'll come back and pick it up later.

gerhard · 2024-05-02T11:57:57Z

@jedevc does the following command pass for you locally, on main?

./hack/dev
./hack/with-dev dagger call --progress=plain --source=.:default --host-docker-config=file:$HOME/.docker/config.json test important --race=true 2>&1 | tee engine.testdev.2024-05-02.main.run1.txt

20mins later, this is still running hot for me on macOS:

As I finished typing that, it failed:

Attaching my full output gzipped: engine.testdev.hannibal.2024-05-02.main.run1.txt.gz

jedevc · 2024-05-02T13:26:54Z

Managed to track down why we have a huge log explosion, which is definitely not helping.

In this new setup, we have:

A top-level dagger client, dagger call --source=. dev --target=. with-exec --args "sh,-c,./hack/make engine:testimportant"
A dev dagger client, dagger call --source=. test important (called by ./hack/make engine:testimportant)
Individual test dagger clients (at least one per test)

Now, all of these use --progress=plain in CI, since no TUI is available. However, since dagger v0.11.1 (see #7069), we now have this little summary thing with logs in it - this essentially x2s the size of the logs.

This makes the logs... tricky to read. But also, because of the layering, we essentially get an x8 multiplier to the size of our client logs, with tons of redundant info.

gerhard · 2024-05-02T16:57:01Z

That is very helpful to know @jedevc! Seems that we are getting somewhere 💪

I was able to set up a clean Ubuntu 24.04 LTS - Linux 6.8.0-31-generic - host with:

8 vCPUs
16GB of RAM
100GB NVMe

This is effectively the Premium Intel 8 vCPUs variant - a.k.a. c-8-intel. FTR:

I installed:

Docker Community Engine v26.1.1 via https://docs.docker.com/engine/install/ubuntu
Go v1.22 (OS package)
Dagger CLI v0.11.1 via https://docs.dagger.io/install
Netdata via https://learn.netdata.cloud/docs/netdata-agent/installation/

And then ran the following commands:

git clone https://github.com/dagger/dagger
cd dagger
./hack/dev
./hack/with-dev dagger call --progress=plain --source=.:default --host-docker-config=file:$HOME/.docker/config.json test important --race=true 2>&1 | tee engine.testdev.2024-05-02.main.run1.txt

First run succeeded in 33mins: engine.testdev.2024-05-02.main.run1.txt.gz

This is what the system metrics looked like:

👆CPU maxed out for most of the run. Take away: 8 vCPUs are not enough. We need 16 vCPUs.

👆 Load reached 70 . Take away: 8 vCPUs are not enough. We need 16 vCPUs.

👆While the disk hits 150MB/s, it's not saturated - 49%.

Second run failed in 32mins: engine.testdev.2024-05-02.main.run2.txt.gz

System metrics looked exactly the same, no change.

Next steps:

Resize to 16 vCPU & 32GB RAM
Re-run

gerhard · 2024-05-03T14:27:44Z

I re-ran this yesterday on c-16-intel twice and it failed both times within 15mins.

👆 engine.testdev.2024-05-02.main.run3.txt.gz

👆 engine.testdev.2024-05-02.main.run4.txt.gz

The good news is that it took half the time of c-8-intel - 15min vs 33mins. This confirms the workload is CPU-bound.

I am running this again on a fresh instance. I will run it twice:

Before this change
After this change

I will capture the duration as well as the system resource usage (cpu, load & disk throughput + utilization).

If this change is within 10% of main, we are going to merge it as is & move on.

If it isn't, I am moving onto other PRs, specifically:

gerhard · 2024-05-03T14:53:58Z

Before this change ✅ `PASS` in `18m 3s`

engine.testdev.2024-05-03.main.run1.txt.gz

./hack/dev
time ./hack/with-dev ./hack/make engine:testimportant 2>&1 | tee engine.testdev.2024-05-03.main.run1.txt

After this change ✅ `PASS` in `23m 29s`

engine.testdev.2024-05-03.pr7223.run1.txt.gz

./hack/dev
time ./hack/with-dev dagger call --progress=plain --source=. dev --target=. with-exec --args "sh,-c,./hack/make engine:testimportant" 2>&1 | tee engine.testdev.2024-05-03.pr7223.run1.txt

Based on the above results, leaving this as is - good call on the draft change @jedevc 💪

Large GitHub Runners are failing consistently, not worth debugging at this point since we know this works on a vanilla Ubuntu 24.04 instance with Docker - must be an issue related to GitHub Large Runners. FTR: dagger#7223 (comment) Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

* Use Dagger v0.11.2 via the new CI setup for all workflows except dev-engine This one requires Docker with specific fixes that we don't yet have in the new CI setup. Signed-off-by: Gerhard Lazu <gerhard@dagger.io> * Setup CI for new, legacy and vertical scaling The setup we want for production is: - For all <LANG> SDK jobs, run them on the new CI only - For testdev, run them on the docker-fix legacy CI - For test/dagger-runner, run them on both legacy CI and new CI - For all the rest, run them on the new CI and github runners for the really simple jobs Signed-off-by: Matias Pan <matias@dagger.io> * Rename concurrency group Signed-off-by: Matias Pan <matias@dagger.io> * Install curl on production vertical scaling runner Signed-off-by: Matias Pan <matias@dagger.io> * Add customizable runner for separate perf tests Signed-off-by: Matias Pan <matias@dagger.io> * Rename to _async_hack_make Signed-off-by: Matias Pan <matias@dagger.io> * Upgrade missing workflow to v0.11.1 Signed-off-by: Matias Pan <matias@dagger.io> * Target nvme Signed-off-by: Matias Pan <matias@dagger.io> * CI: Default to 4CPUs & NVMe disks Otherwise the workflows are too slow on the new CI runners and are blocking the migration off the legacy CI runners. Signed-off-by: Gerhard Lazu <gerhard@dagger.io> * Bump to v0.11.2 & capture extra details in comments Signed-off-by: Gerhard Lazu <gerhard@dagger.io> * Debug dagger-engine.dev in large GitHub Runner Signed-off-by: Gerhard Lazu <gerhard@dagger.io> * Continuer running engine:testdev in dagger-runner-docker-fix runner Large GitHub Runners are failing consistently, not worth debugging at this point since we know this works on a vanilla Ubuntu 24.04 instance with Docker - must be an issue related to GitHub Large Runners. FTR: #7223 (comment) Signed-off-by: Gerhard Lazu <gerhard@dagger.io> --------- Signed-off-by: Gerhard Lazu <gerhard@dagger.io> Signed-off-by: Matias Pan <matias@dagger.io> Co-authored-by: Gerhard Lazu <gerhard@dagger.io> Co-authored-by: Matias Pan <matias@dagger.io>

* Use Dagger v0.11.2 via the new CI setup for all workflows except dev-engine This one requires Docker with specific fixes that we don't yet have in the new CI setup. Signed-off-by: Gerhard Lazu <gerhard@dagger.io> * Setup CI for new, legacy and vertical scaling The setup we want for production is: - For all <LANG> SDK jobs, run them on the new CI only - For testdev, run them on the docker-fix legacy CI - For test/dagger-runner, run them on both legacy CI and new CI - For all the rest, run them on the new CI and github runners for the really simple jobs Signed-off-by: Matias Pan <matias@dagger.io> * Rename concurrency group Signed-off-by: Matias Pan <matias@dagger.io> * Install curl on production vertical scaling runner Signed-off-by: Matias Pan <matias@dagger.io> * Add customizable runner for separate perf tests Signed-off-by: Matias Pan <matias@dagger.io> * Rename to _async_hack_make Signed-off-by: Matias Pan <matias@dagger.io> * Upgrade missing workflow to v0.11.1 Signed-off-by: Matias Pan <matias@dagger.io> * Target nvme Signed-off-by: Matias Pan <matias@dagger.io> * CI: Default to 4CPUs & NVMe disks Otherwise the workflows are too slow on the new CI runners and are blocking the migration off the legacy CI runners. Signed-off-by: Gerhard Lazu <gerhard@dagger.io> * Bump to v0.11.2 & capture extra details in comments Signed-off-by: Gerhard Lazu <gerhard@dagger.io> * Debug dagger-engine.dev in large GitHub Runner Signed-off-by: Gerhard Lazu <gerhard@dagger.io> * Continuer running engine:testdev in dagger-runner-docker-fix runner Large GitHub Runners are failing consistently, not worth debugging at this point since we know this works on a vanilla Ubuntu 24.04 instance with Docker - must be an issue related to GitHub Large Runners. FTR: dagger#7223 (comment) Signed-off-by: Gerhard Lazu <gerhard@dagger.io> --------- Signed-off-by: Gerhard Lazu <gerhard@dagger.io> Signed-off-by: Matias Pan <matias@dagger.io> Co-authored-by: Gerhard Lazu <gerhard@dagger.io> Co-authored-by: Matias Pan <matias@dagger.io>

ci: don't use random id in cache volume name for dev

1b9f2ba

Signed-off-by: Justin Chadwell <me@jedevc.com>

jedevc requested review from gerhard and sipsma April 30, 2024 13:39

jedevc force-pushed the use-nested-dagger-in-dagger-for-testdev branch 2 times, most recently from 781aef2 to 0d74b67 Compare April 30, 2024 14:38

jedevc added 2 commits April 30, 2024 16:03

hack: bump client to v0.11.1 to match legacy runners

8f01bb7

Signed-off-by: Justin Chadwell <me@jedevc.com>

jedevc force-pushed the use-nested-dagger-in-dagger-for-testdev branch from 0d74b67 to 8f01bb7 Compare April 30, 2024 15:03

fix: ensure separate ips for tests and service

fdf023c

This should allow nested networking to actually work. Signed-off-by: Justin Chadwell <me@jedevc.com>

jedevc mentioned this pull request May 3, 2024

feat: improve plain progress #7272

Open

jedevc marked this pull request as draft May 3, 2024 16:07

gerhard mentioned this pull request May 6, 2024

Migrate to new ci runners #7082

Merged

gerhard mentioned this pull request May 9, 2024

🧐 dagger call --source=.:default test all is timing out locally - inconsistent behaviour #7339

Open

jedevc mentioned this pull request May 10, 2024

Start removing mage 🎉 #7349

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: use nested dagger for testdev #7223

ci: use nested dagger for testdev #7223

jedevc commented Apr 30, 2024 •

edited

gerhard commented May 1, 2024 •

edited

gerhard commented May 1, 2024

gerhard commented May 1, 2024

jedevc commented May 2, 2024

gerhard commented May 2, 2024 •

edited

jedevc commented May 2, 2024

gerhard commented May 2, 2024 •

edited

gerhard commented May 2, 2024 •

edited

gerhard commented May 2, 2024 •

edited

jedevc commented May 2, 2024

gerhard commented May 2, 2024 •

edited

jedevc commented May 2, 2024

gerhard commented May 2, 2024 •

edited

gerhard commented May 3, 2024

gerhard commented May 3, 2024 •

edited

ci: use nested dagger for testdev #7223

Are you sure you want to change the base?

ci: use nested dagger for testdev #7223

Conversation

jedevc commented Apr 30, 2024 • edited

gerhard commented May 1, 2024 • edited

gerhard commented May 1, 2024

gerhard commented May 1, 2024

jedevc commented May 2, 2024

gerhard commented May 2, 2024 • edited

jedevc commented May 2, 2024

gerhard commented May 2, 2024 • edited

gerhard commented May 2, 2024 • edited

gerhard commented May 2, 2024 • edited

jedevc commented May 2, 2024

gerhard commented May 2, 2024 • edited

jedevc commented May 2, 2024

gerhard commented May 2, 2024 • edited

gerhard commented May 3, 2024

gerhard commented May 3, 2024 • edited

Before this change ✅ PASS in 18m 3s

After this change ✅ PASS in 23m 29s

jedevc commented Apr 30, 2024 •

edited

gerhard commented May 1, 2024 •

edited

gerhard commented May 2, 2024 •

edited

gerhard commented May 2, 2024 •

edited

gerhard commented May 2, 2024 •

edited

gerhard commented May 2, 2024 •

edited

gerhard commented May 2, 2024 •

edited

gerhard commented May 2, 2024 •

edited

gerhard commented May 3, 2024 •

edited

Before this change ✅ `PASS` in `18m 3s`

After this change ✅ `PASS` in `23m 29s`