Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: use nested dagger for testdev #7223

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

jedevc
Copy link
Member

@jedevc jedevc commented Apr 30, 2024

Specifically! This doesn't spin up a docker container with a dev version of dagger, it spins up a dagger service with a dev version of dagger - dagger in dagger!

This simplifies deployment to the new CI runners (since docker is no longer needed).


At some point, it would be nice to have the ./hack/dev script use this as well - however, I'm not 100% sure how this would work with other core devs' workflows. Personally, I periodically run ./hack/dev to build and restart the docker engine, and use ./hack/with-dev to run commands in that context. This doesn't work as well with this setup, where it's more difficult to observe the logs in a running service, etc.

So maybe we need to keep ./hack/dev at least for now - we can discuss and maybe do this in the future though.

Signed-off-by: Justin Chadwell <me@jedevc.com>
@jedevc jedevc requested review from gerhard and sipsma April 30, 2024 13:39
@jedevc jedevc force-pushed the use-nested-dagger-in-dagger-for-testdev branch 2 times, most recently from 781aef2 to 0d74b67 Compare April 30, 2024 14:38
Specifically! This doesn't spin up a docker container with a dev version
of dagger, it spins up a dagger service with a dev version of dagger -
dagger in dagger!

This simplifies deployment to the new CI runners.

Signed-off-by: Justin Chadwell <me@jedevc.com>
Signed-off-by: Justin Chadwell <me@jedevc.com>
@jedevc jedevc force-pushed the use-nested-dagger-in-dagger-for-testdev branch from 0d74b67 to 8f01bb7 Compare April 30, 2024 15:03
This should allow nested networking to actually work.

Signed-off-by: Justin Chadwell <me@jedevc.com>
@gerhard
Copy link
Member

gerhard commented May 1, 2024

I finally caught up with this. I was unable to get dagger call --source=. dev --target=. with-exec --args "sh,-c,./hack/make engine:testimportant" passing locally:

    multi.go:85: 7: Error: failed to serve module: input: module.withSource.initialize resolve: failed to initialize module: failed to add object to module "test": failed to vali
date type def: cannot define function with reserved name "id" on object "Test"
    multi.go:85:                                                                                                                                                                      multi.go:85: 8: in exec dagger --debug query
    multi.go:85: 8: 1: in
    multi.go:85: 8: dagger --debug query
    multi.go:85: 8: 1: Error: failed to serve module: input: module.withSource.initialize resolve: failed to initialize module: failed to add object to module "test": failed to v
alidate type def: cannot define function with reserved name "id" on object "Test"
    multi.go:85: 8: 17:54:27 DBG frontend exporting spans=1
    multi.go:85: 8: 17:54:27 DBG frontend exporting span trace=4b05bbee182cccc87ab152b6bff37243 id=5a38dfed6e7afd9a parent=04b938ceab5c844b span="dagger --debug query"
    multi.go:85: 8: 17:54:27 DBG recording span span="dagger --debug query" id=5a38dfed6e7afd9a
    multi.go:85: 8: 17:54:27 DBG recording span child span="dagger --debug query" parent=04b938ceab5c844b child=5a38dfed6e7afd9a                                                      multi.go:85: 8: 17:54:45 DBG frontend exporting spans=1
    multi.go:85: 8: 17:54:45 DBG frontend exporting span trace=4b05bbee182cccc87ab152b6bff37243 id=5a38dfed6e7afd9a parent=04b938ceab5c844b span="dagger --debug query"
    multi.go:85: 8: 17:54:45 DBG new end old="0001-01-01 00:00:00 +0000 UTC" new="2024-05-01 17:54:45.914647407 +0000 UTC"
    multi.go:85: 8: 17:54:45 DBG recording span span="dagger --debug query" id=5a38dfed6e7afd9a
    multi.go:85: 8: 17:54:45 DBG recording span child span="dagger --debug query" parent=04b938ceab5c844b child=5a38dfed6e7afd9a
    multi.go:85: 8: 17:54:45 DBG frontend exporting logs logs=1
    multi.go:85: 8: 17:54:45 DBG exporting log span=5a38dfed6e7afd9a body="Error: failed to serve module: input: module.withSource.initialize resolve: failed to initialize module
: failed to add object to module \"test\": failed to validate type def: cannot define function with reserved name \"id\" on object \"Test\"\n\n"
    multi.go:85: 8: Error: failed to serve module: input: module.withSource.initialize resolve: failed to initialize module: failed to add object to module "test": failed to vali
date type def: cannot define function with reserved name "id" on object "Test"
    multi.go:85: 8: Error: failed to serve module: input: module.withSource.initialize resolve: failed to initialize module: failed to add object to module "test": failed to vali
date type def: cannot define function with reserved name "id" on object "Test"

I attached all the logs as a 17MB gzip: engine.testdev.x22.2024-05-01.run1.txt.gz. The logs are 215MB uncompressed.

I ran it multiple times and keep seeing it fail. I am running it on a different host to double-check.


The challenge is that each run task more than 60mins (and only 16mins in CI).

As dagger call --source=. dev --target=. with-exec --args "sh,-c,./hack/make engine:testimportant" is running, this is what the machine looked like in that period:
image

👆 The elevated iowait made me want to check on the disk:

This is what the disk reported:
image
image

I can see a lot of small operations (2k/s), but since utilisation was at 45%, this is not a bottleneck.

While bandwidth was never the issue (interface is 10Gbit):
image

I found the (growing) number of open TCP sockets interesting:
image

FWIW:
image

@gerhard
Copy link
Member

gerhard commented May 1, 2024

I ran the same on:
image

While the test finished in 15mins and it appeared to pass (command exited with 0) these lines in the logs are making me suspicious:

Error: response from query: input: container.from.withExec.withEnvVariable.withExec.withWorkdir.withDirectory.withMountedCache.withExec.withMountedDirectory.withMountedCache.withExec.withMountedDirectory.withMountedFile.withEnvVariable.withServiceBinding.withEnvVariable.withWorkdir.withExec.sync resolve: process "sh -c ./hack/make engine:testimportant" did not complete successfully: exit code: 125                                                                                                                                                                                                        �[0m

I dug a bit more and found this:

          �[90m┃�[0m file: failed to create temp dir: mkdir /tmp/buildkit-mount2491926873: no space left on device kind:*fmt.wrapError stack:<nil>]"                                                                                                                                                                 �[0m

I'm attaching the full logs compressed: engine.testdev.hannibal.2024-05-01.run1.txt.gz

Are you able to reproduce the same issue on your workstation @jedevc?

@gerhard
Copy link
Member

gerhard commented May 1, 2024

I am re-running dagger call --source=. dev --target=. with-exec --args "sh,-c,./hack/make engine:testimportant" and also capturing the system stats on the Mac. Also re-running on the Linux workstation to see if the 1h behaviour is consistent.

@jedevc
Copy link
Member Author

jedevc commented May 2, 2024

@gerhard the first error logs you shared reveal that the tests fail because of a 1hr timeout:

�[95m9: �[0m�[95m13: �[0mpanic: test timed out after 1h0m0s
�[95m9: �[0m�[95m13: �[0mrunning tests:
�[95m9: �[0m�[95m13: �[0m	TestModuleConstructor/basic/go (2m33s)
�[95m9: �[0m�[95m13: �[0m
�[95m9: �[0m	TestModuleConstructor/basic/python (2m12s)
�[95m9: �[0m�[95m13: �[0m	TestModuleConstructor/basic/typescript (2m31s)
�[95m9: �[0m�[95m13: �[0m
�[95m9: �[0m	TestModuleDaggerCallArgTypes/directory_arg_inputs/local_dir/rel_path (1m38s)
�[95m9: �[0m�[95m13: �[0m	TestModuleReservedWords/id/arg/python (1m6s)
�[95m9: �[0m�[95m13: �[0m
�[95m9: �[0m	TestModuleReservedWords/id/arg/typescript (57s)
�[95m9: �[0m�[95m13: �[0m	TestModuleReservedWords/id/field/typescript (1m6s)
�[95m9: �[0m�[95m13: �[0m	TestModuleReservedWords/id/fn/python (47s)

None of the tests here have been running for that long - I suspect for some reason the tests are just running slowly. Will attempt to run this locally to see if I can get similar results, I'm not quite sure I understand the disparity between local+ci here.


For the Mac build, I suspect the issue you detect with out-of-space is legitimate - are you using docker desktop? If so, how much space is allocated to the docker desktop disk, and how much is in use? From past experience, the defaults for this are far too low.

Also, weird, it's very strange to get this error while creating a file for /tmp - how much RAM is associated?

Also the logs don't seem to be complete. They cut off at the end:

�[90m┃�[0m �[0m�[32mPASS�[0m core/integration.TestModuleCodegenOnDepChange/python_uses_go (142.06s)�[0m                                                                                                                                                                                                                               �[0m
�[90m┃�[0m �[0m�[32mPASS�[0m core/integration.TestModulePythonProjectLayout/hatch/main.py (46.10s)�[0m                                                                                                                                                                                                                                �[0m
�[90m┃�[0m �[0m�[32mPASS�[0m core/integration.TestMo

I can't repro an issue where the tests fail but the command returns 0.


I do notice that with this approach the size of logs does increase - it appears that we get a lot of repetition, but I'm not entirely sure why this happens. Will attempt to do some investigation here.

@gerhard
Copy link
Member

gerhard commented May 2, 2024

Hitting that 1h timeout is completely unexpected. This is a Linux 6.6.28 host with 16 CPUs (AMD Ryzen 7 5800X) , 64GB of DDR4 & 1TB NVMe (980 PRO). It's as close as it gets to the CI runner hosts. The only suspect that I have at this point is Docker 24.0.5 on NixOS 24.05. Would be good to know what results you see on your Linux host. I will try running this again on another host.


As for the macOS host, the Docker Desktop uses the following config:

  • 10CPUs
  • 16GB of RAM
  • 64GB of virtual disk (Docker Desktop was reporting 45GB available)

I am going to bump the disk to 128GB and try again. Current status:
image

There is a single container running:

docker ps
CONTAINER ID   IMAGE                               COMMAND                  CREATED        STATUS          PORTS     NAMES
a0f0853c63dc   registry.dagger.io/engine:v0.11.1   "dagger-entrypoint.s…"   17 hours ago   Up 58 seconds             dagger-engine-63121e921843d412

FWIW, the 2nd run failed as well: engine.testdev.hannibal.2024-05-01.run2.txt.gz

Doing the 3rd run now using this exact command:

dagger call --progress=plain --source=. dev --target=. with-exec --args "sh,-c,./hack/make engine:testimportant" 2>&1 | tee engine.testdev.hannibal.2024-05-02.run3.txt

@jedevc
Copy link
Member Author

jedevc commented May 2, 2024

Out of curiosity, can you successfully run main's testdev on your machine successfully?

I guess this PR is blocked in the meantime, I also don't seem to complete it in any reasonable amount of time. There's clearly some horrendous performance issues in our tests, that really need a thorough investigation - none of these should be taking this long.

@gerhard
Copy link
Member

gerhard commented May 2, 2024

I am waiting for the results on macOS. It should be done within 10 mins. Will try the same command from main after.

Current status:
image

The disk is also working overtime - at least 150MB/s constant with some 400MB/s peaks:
image

There is nothing else running on this M1 Max except the command captured above.


I've got my result:
image

I am attaching the compressed logs: engine.testdev.hannibal.2024-05-02.run3.txt.gz (19MB compressed & 244MB plain).

From my POV, this approach is a no-go. It uses too many resources and it keeps failing in different ways across multiple machines. This is what the macOS host looked like while running this test only:
image
image

I have no idea how it passes in CI 🤷‍♂️

I would like to stop digging in this direction and try something else instead. See my later comments.

@gerhard
Copy link
Member

gerhard commented May 2, 2024

Out of curiosity, can you successfully run main's testdev on your machine successfully?

Running this now on the Linux machine. Current status:

dagger-0.11.1 call --progress=plain --source=.:default --host-docker-config=file:/home/gerhard/.docker/config.json test all --race=true 2>&1 | tee engine.testrace.x22.2024-05-02.run1.txt

FTR, this is how I found my 3rd run of this on the Linux host:
image

Hit the infamous issue again:

@gerhard
Copy link
Member

gerhard commented May 2, 2024

What are your thoughts @jedevc on leaving this as is, and instead continue running this in Docker, as we have been so far, but in larger GitHub runners?

@jedevc
Copy link
Member Author

jedevc commented May 2, 2024

Yeah, that's fine - I'm not quite sure what specifically in this setup makes this different locally on different machines.

We can leave this open, and I'll come back and pick it up later.

@gerhard
Copy link
Member

gerhard commented May 2, 2024

@jedevc does the following command pass for you locally, on main?

./hack/dev
./hack/with-dev dagger call --progress=plain --source=.:default --host-docker-config=file:$HOME/.docker/config.json test important --race=true 2>&1 | tee engine.testdev.2024-05-02.main.run1.txt

20mins later, this is still running hot for me on macOS:
image

As I finished typing that, it failed:
image

Attaching my full output gzipped: engine.testdev.hannibal.2024-05-02.main.run1.txt.gz

@jedevc
Copy link
Member Author

jedevc commented May 2, 2024

Managed to track down why we have a huge log explosion, which is definitely not helping.

In this new setup, we have:

  • A top-level dagger client, dagger call --source=. dev --target=. with-exec --args "sh,-c,./hack/make engine:testimportant"
  • A dev dagger client, dagger call --source=. test important (called by ./hack/make engine:testimportant)
  • Individual test dagger clients (at least one per test)

Now, all of these use --progress=plain in CI, since no TUI is available. However, since dagger v0.11.1 (see #7069), we now have this little summary thing with logs in it - this essentially x2s the size of the logs.

This makes the logs... tricky to read. But also, because of the layering, we essentially get an x8 multiplier to the size of our client logs, with tons of redundant info.

@gerhard
Copy link
Member

gerhard commented May 2, 2024

That is very helpful to know @jedevc! Seems that we are getting somewhere 💪

I was able to set up a clean Ubuntu 24.04 LTS - Linux 6.8.0-31-generic - host with:

  • 8 vCPUs
  • 16GB of RAM
  • 100GB NVMe

This is effectively the Premium Intel 8 vCPUs variant - a.k.a. c-8-intel. FTR:

I installed:

And then ran the following commands:

git clone https://github.com/dagger/dagger
cd dagger
./hack/dev
./hack/with-dev dagger call --progress=plain --source=.:default --host-docker-config=file:$HOME/.docker/config.json test important --race=true 2>&1 | tee engine.testdev.2024-05-02.main.run1.txt

First run succeeded in 33mins: engine.testdev.2024-05-02.main.run1.txt.gz

This is what the system metrics looked like:
image
👆CPU maxed out for most of the run. Take away: 8 vCPUs are not enough. We need 16 vCPUs.

image

👆 Load reached 70 . Take away: 8 vCPUs are not enough. We need 16 vCPUs.

image image

👆While the disk hits 150MB/s, it's not saturated - 49%.

Second run failed in 32mins: engine.testdev.2024-05-02.main.run2.txt.gz

image

System metrics looked exactly the same, no change.

Next steps:

  • Resize to 16 vCPU & 32GB RAM
  • Re-run

@gerhard
Copy link
Member

gerhard commented May 3, 2024

I re-ran this yesterday on c-16-intel twice and it failed both times within 15mins.

Screenshot 2024-05-02 at 18 59 40

👆 engine.testdev.2024-05-02.main.run3.txt.gz

Screenshot 2024-05-02 at 19 41 43

👆 engine.testdev.2024-05-02.main.run4.txt.gz

The good news is that it took half the time of c-8-intel - 15min vs 33mins. This confirms the workload is CPU-bound.

I am running this again on a fresh instance. I will run it twice:

  1. Before this change
  2. After this change

I will capture the duration as well as the system resource usage (cpu, load & disk throughput + utilization).

If this change is within 10% of main, we are going to merge it as is & move on.

If it isn't, I am moving onto other PRs, specifically:

@gerhard
Copy link
Member

gerhard commented May 3, 2024

image

Before this change ✅ PASS in 18m 3s

engine.testdev.2024-05-03.main.run1.txt.gz

./hack/dev
time ./hack/with-dev ./hack/make engine:testimportant 2>&1 | tee engine.testdev.2024-05-03.main.run1.txt
Screenshot 2024-05-03 at 16 23 30 Screenshot 2024-05-03 at 16 23 47 Screenshot 2024-05-03 at 16 23 58 Screenshot 2024-05-03 at 16 24 20

After this change ✅ PASS in 23m 29s

engine.testdev.2024-05-03.pr7223.run1.txt.gz

./hack/dev
time ./hack/with-dev dagger call --progress=plain --source=. dev --target=. with-exec --args "sh,-c,./hack/make engine:testimportant" 2>&1 | tee engine.testdev.2024-05-03.pr7223.run1.txt
image image image image

Based on the above results, leaving this as is - good call on the draft change @jedevc 💪

@jedevc jedevc marked this pull request as draft May 3, 2024 16:07
gerhard added a commit to jedevc/dagger that referenced this pull request May 6, 2024
Large GitHub Runners are failing consistently, not worth debugging at
this point since we know this works on a vanilla Ubuntu 24.04 instance
with Docker - must be an issue related to GitHub Large Runners.

FTR: dagger#7223 (comment)

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
gerhard added a commit that referenced this pull request May 6, 2024
* Use Dagger v0.11.2 via the new CI setup for all workflows except dev-engine

This one requires Docker with specific fixes that we don't yet have in
the new CI setup.

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

* Setup CI for new, legacy and vertical scaling

The setup we want for production is:
- For all <LANG> SDK jobs, run them on the new CI only
- For testdev, run them on the docker-fix legacy CI
- For test/dagger-runner, run them on both legacy CI and new CI
- For all the rest, run them on the new CI and github runners for the really simple jobs

Signed-off-by: Matias Pan <matias@dagger.io>

* Rename concurrency group

Signed-off-by: Matias Pan <matias@dagger.io>

* Install curl on production vertical scaling runner

Signed-off-by: Matias Pan <matias@dagger.io>

* Add customizable runner for separate perf tests

Signed-off-by: Matias Pan <matias@dagger.io>

* Rename to _async_hack_make

Signed-off-by: Matias Pan <matias@dagger.io>

* Upgrade missing workflow to v0.11.1

Signed-off-by: Matias Pan <matias@dagger.io>

* Target nvme

Signed-off-by: Matias Pan <matias@dagger.io>

* CI: Default to 4CPUs & NVMe disks

Otherwise the workflows are too slow on the new CI runners and are
blocking the migration off the legacy CI runners.

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

* Bump to v0.11.2 & capture extra details in comments

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

* Debug dagger-engine.dev in large GitHub Runner

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

* Continuer running engine:testdev in dagger-runner-docker-fix runner

Large GitHub Runners are failing consistently, not worth debugging at
this point since we know this works on a vanilla Ubuntu 24.04 instance
with Docker - must be an issue related to GitHub Large Runners.

FTR: #7223 (comment)

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

---------

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
Signed-off-by: Matias Pan <matias@dagger.io>
Co-authored-by: Gerhard Lazu <gerhard@dagger.io>
Co-authored-by: Matias Pan <matias@dagger.io>
vikram-dagger pushed a commit to vikram-dagger/dagger that referenced this pull request May 8, 2024
* Use Dagger v0.11.2 via the new CI setup for all workflows except dev-engine

This one requires Docker with specific fixes that we don't yet have in
the new CI setup.

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

* Setup CI for new, legacy and vertical scaling

The setup we want for production is:
- For all <LANG> SDK jobs, run them on the new CI only
- For testdev, run them on the docker-fix legacy CI
- For test/dagger-runner, run them on both legacy CI and new CI
- For all the rest, run them on the new CI and github runners for the really simple jobs

Signed-off-by: Matias Pan <matias@dagger.io>

* Rename concurrency group

Signed-off-by: Matias Pan <matias@dagger.io>

* Install curl on production vertical scaling runner

Signed-off-by: Matias Pan <matias@dagger.io>

* Add customizable runner for separate perf tests

Signed-off-by: Matias Pan <matias@dagger.io>

* Rename to _async_hack_make

Signed-off-by: Matias Pan <matias@dagger.io>

* Upgrade missing workflow to v0.11.1

Signed-off-by: Matias Pan <matias@dagger.io>

* Target nvme

Signed-off-by: Matias Pan <matias@dagger.io>

* CI: Default to 4CPUs & NVMe disks

Otherwise the workflows are too slow on the new CI runners and are
blocking the migration off the legacy CI runners.

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

* Bump to v0.11.2 & capture extra details in comments

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

* Debug dagger-engine.dev in large GitHub Runner

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

* Continuer running engine:testdev in dagger-runner-docker-fix runner

Large GitHub Runners are failing consistently, not worth debugging at
this point since we know this works on a vanilla Ubuntu 24.04 instance
with Docker - must be an issue related to GitHub Large Runners.

FTR: dagger#7223 (comment)

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

---------

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
Signed-off-by: Matias Pan <matias@dagger.io>
Co-authored-by: Gerhard Lazu <gerhard@dagger.io>
Co-authored-by: Matias Pan <matias@dagger.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants