Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 Improve release speed #9392

Merged
merged 1 commit into from
Sep 19, 2023

Conversation

cahillsf
Copy link
Member

@cahillsf cahillsf commented Sep 11, 2023

What this PR does / why we need it:
Last release cut (slack thread for ref) we were blocked pretty early on in the process waiting for the post-cluster-api-push-images job to complete. We cut two patch releases (1.4.6 and 1.5.1) with each taking ~28 minutes to complete.

This PR improves the speed of the release process by:

  • splitting the original make command into 3 sub-makes to enable parallelization within the cloudbuild job
  • passing -j 32 to the make command to parallelize each sub-make and -O to unify the logging output for each job
  • bumping the cloudbuild machine to a larger size

The results of my local testing have been promising, with the most recent job with the current changes completing in under 6 minutes: https://a.cl.ly/5zubWldx

Here is some historical testing data from my sandbox:

  • builds with the 3 make targets split into 3 separate targets (3 steps defined in the cloudbuild) with varying -j option vals and machine sizes: https://a.cl.ly/6quw9NJX
  • local run overview from the original config (the job ran in 26 minutes which is consistent with the results from the last patch release cut): https://a.cl.ly/4guRAqQK

/area release

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. area/release Issues or PRs related to releasing cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 11, 2023
@mdbooth
Copy link
Contributor

mdbooth commented Sep 11, 2023

There are a couple of simpler way to achieve this just using Make. Firstly you can just define release-staging as:

.PHONY: release-staging
release-staging: ## Build and push container images to the staging bucket
        REGISTRY=$(STAGING_REGISTRY) $(MAKE) docker-build-all
        REGISTRY=$(STAGING_REGISTRY) $(MAKE) docker-push-all
        REGISTRY=$(STAGING_REGISTRY) $(MAKE) release-alias-tag

Then calling make release-staging -j8 will pass the parallel argument to the submakes, but still execute them sequentially.

Alternatively you could define docker-build-% as a dependency of docker-push-%, i.e.:

docker-push-%: docker-build-%
        $(MAKE) ARCH=$* docker-push

You can then remove a step from release-staging:

release-staging: ## Build and push container images to the staging bucket
        REGISTRY=$(STAGING_REGISTRY) $(MAKE) docker-push-all
        REGISTRY=$(STAGING_REGISTRY) $(MAKE) release-alias-tag

This will allow make to start push as soon as the relevant build completes rather than waiting for them all to complete, which is probably a marginal speed improvement.

Incidentally, I rewrote cloud-provider-openstack's cloudbuild to use buildx for multi-arch recently and was very happy with the results both in terms of performance and maintainability:

https://github.com/kubernetes/cloud-provider-openstack/blob/master/Dockerfile
https://github.com/kubernetes/cloud-provider-openstack/blob/master/cloudbuild.yaml
https://github.com/kubernetes/cloud-provider-openstack/blob/df6131eef40621b3660bb9634c8a0ed93927ab83/Makefile#L172-L178

Although note that CPO's build is somewhat more complicated than normal because we build multiple images for each architecture, and for performance we build them all in the same buildx builder.

cloud-provider-aws has a somewhat simpler multi-arch cloudbuild example:
https://github.com/kubernetes/cloud-provider-aws/blob/master/cloudbuild.yaml

Copy link
Contributor

@killianmuldoon killianmuldoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any data on the impact of this on the speed? I'd prefer to only merge performance improvements where there is a significant improvement.

@mdbooth
Copy link
Contributor

mdbooth commented Sep 11, 2023

Do we have any data on the impact of this on the speed? I'd prefer to only merge performance improvements where there is a significant improvement.

I believe this has the potential to improve performance by a factor of the number of build architectures, which is currently 5.

However, it will be dominated by the build stage, which is already heavily multi-process because it's Go. Running all 5 of them simultaneously on the same system is not likely to give a 5x performance benefit.

@cahillsf
Copy link
Member Author

cahillsf commented Sep 11, 2023

@mdbooth thanks for the insights -- i thought i was missing something here, Cloudbuild doesn't provide much telemetry on resource consumption during the build jobs but your findings are consistent with a lot of the questions i had while testing this messing around with the parallelism and machine size

@killianmuldoon here is the slack thread for ref. so far yes, our build times from the previous release were ~28 mins. here are the most recent results from the builds I have been testing with (fastest I have gotten it to run is ~6 mins which is about consistent with the 5x improvement mdbooth speculated): https://a.cl.ly/6quw9NJX

@cahillsf cahillsf closed this Sep 11, 2023
@cahillsf cahillsf reopened this Sep 11, 2023
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 11, 2023
@cahillsf
Copy link
Member Author

cahillsf commented Sep 12, 2023

here is a screenshot from the most recent run overview with improvements suggested in the draft + a larger machine (bumped to E2_HIGHCPU_32): https://a.cl.ly/5zubWldx (the job ran in under 6 minutes)

For comparison, here is the screenshot of the run overview from the original config: https://a.cl.ly/4guRAqQK (the job ran in 26 minutes)

@killianmuldoon
Copy link
Contributor

Thanks for adding the data @cahillsf! Take this out of draft once you think it's ready to fully review and I'll come back to it.

@cahillsf cahillsf marked this pull request as ready for review September 12, 2023 12:25
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 12, 2023
Copy link
Contributor

@mdbooth mdbooth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Makefile Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 12, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: a83d2023e475d565f960b12d55c621dc5f033b5d

@furkatgofurov7
Copy link
Member

here is a screenshot from the most recent run overview with improvements suggested in the draft + a larger machine (bumped to E2_HIGHCPU_32): https://a.cl.ly/5zubWldx (the job ran in under 6 minutes)

For comparison, here is the screenshot of the run overview from the original config: https://a.cl.ly/4guRAqQK (the job ran in 26 minutes)

Thanks for working on it @cahillsf, this is already a huge improvement IMO.

/lgtm

@cahillsf
Copy link
Member Author

👋 bumping this when you have a chance to review @killianmuldoon

Copy link
Contributor

@killianmuldoon killianmuldoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, but this looks generally good to me.

Makefile Outdated Show resolved Hide resolved
revert to second variant
Copy link
Contributor

@killianmuldoon killianmuldoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 19, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 8505157bdf71b000815b83acfccda376cb83a988

Copy link
Member

@furkatgofurov7 furkatgofurov7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

This would be nice to try out before upcoming patch releases next week! Hopefully, we will have enough reviews by then from another maintainer 🙂

@sbueringer
Copy link
Member

Thx everyone for working on this and especially on providing data!!

/approve

@cahillsf Thank you very much for your work on this!!
Can you please update the PR description to reflect the latest state of the PR?

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 19, 2023
@sbueringer
Copy link
Member

@furkatgofurov7 @killianmuldoon I assume we have to cherry-pick this to get an actual improvement for 1.5.x and 1.4.x?

(fine from my side, just making sure we don't forget it)

@k8s-ci-robot k8s-ci-robot merged commit d981496 into kubernetes-sigs:main Sep 19, 2023
21 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.6 milestone Sep 19, 2023
@sbueringer
Copy link
Member

Hm, this should be the first job running with this make target: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-cluster-api-push-images/1704163223483191296
(already > 14m)

@sbueringer
Copy link
Member

Hm one failed with timeout after 46 min. No idea if related: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-cluster-api-push-images/1704163223483191296

@cahillsf
Copy link
Member Author

cahillsf commented Sep 19, 2023

looks like it was on the third sub-make actually that is just an overall failure:

2023/09/19 16:46:07 Failed to run some build jobs: [error running [gcloud builds submit --verbosity info --config /home/prow/go/src/github.com/kubernetes-sigs/cluster-api/cloudbuild.yaml --substitutions _PULL_BASE_REF=main,_GIT_TAG=v20230919-v1.5.0-rc.0-472-gd9814963c --project k8s-staging-cluster-api --gcs-log-dir gs://k8s-staging-cluster-api-gcb/logs --gcs-source-staging-dir gs://k8s-staging-cluster-api-gcb/source gs://k8s-staging-cluster-api-gcb/source/badf5c11-4108-46e6-ba47-6563213bcd31.tgz]: exit status 1]

the logs don't provide much to go off of unless i am missing something.

same error on the subsequent run as well.

any way for us to view the additional context from the Cloud Build UI?

@sbueringer
Copy link
Member

sbueringer commented Sep 19, 2023

There should be the cloud build log in the artifacts folders, e.g. https://storage.googleapis.com/kubernetes-jenkins/logs/post-cluster-api-push-images/1704163223483191296/artifacts/build.log

(except if you already looked at this one)

@furkatgofurov7
Copy link
Member

There is nothing except BUILD TIMEOUT, am I blind or missing something that is causing it? If not, why it did not happen when testing it locally?

@furkatgofurov7
Copy link
Member

There should be the cloud build log in the artifacts folders, e.g. https://storage.googleapis.com/kubernetes-jenkins/logs/post-cluster-api-push-images/1704163223483191296/artifacts/build.log

(except if you already looked at this one)

looks like it was on the third sub-make actually that is just an overall failure:

2023/09/19 16:46:07 Failed to run some build jobs: [error running [gcloud builds submit --verbosity info --config /home/prow/go/src/github.com/kubernetes-sigs/cluster-api/cloudbuild.yaml --substitutions _PULL_BASE_REF=main,_GIT_TAG=v20230919-v1.5.0-rc.0-472-gd9814963c --project k8s-staging-cluster-api --gcs-log-dir gs://k8s-staging-cluster-api-gcb/logs --gcs-source-staging-dir gs://k8s-staging-cluster-api-gcb/source gs://k8s-staging-cluster-api-gcb/source/badf5c11-4108-46e6-ba47-6563213bcd31.tgz]: exit status 1]

the logs don't provide much to go off of unless i am missing something.

same error on the subsequent run as well.

any way for us to view the additional context from the Cloud Build UI?

not I am aware of it, unfortunately.

@cahillsf
Copy link
Member Author

cahillsf commented Sep 19, 2023

yep not much other indications of errors, it's just timing out.

the failed runs are not even completing all of the builds, there should be 35 calls to DOCKER_BUILDKIT=1 docker build --build-arg builder_image= (5 archs x 7 imgs). The failed job logs only indicate 14 calls to this docker build.

@sbueringer
Copy link
Member

sbueringer commented Sep 19, 2023

Should we revert this PR to get the image pushes on main back?

Independent of that it might be good to get some timestamps into these logs

cahillsf added a commit to cahillsf/cluster-api that referenced this pull request Sep 19, 2023
…-release-speed"

This reverts commit d981496, reversing
changes made to 2639f2c.
@cahillsf
Copy link
Member Author

cahillsf commented Sep 19, 2023

here is the PR to revert: #9465

sorry for breaking this

If not, why it did not happen when testing it locally?

that is a good question, will have to try to dig into this

@sbueringer
Copy link
Member

sbueringer commented Sep 19, 2023

sorry for breaking this

No worries :)

I'm open to further experiments, just wanted to avoid pressure to have to fix this ASAP while it's broken

@furkatgofurov7
Copy link
Member

sorry for breaking this

Yeah, no worries, thanks for quick response!

@cahillsf
Copy link
Member Author

hey @sbueringer looking into this piece:

Independent of that it might be good to get some timestamps into these logs

trying to track down where the best place to make this modification would be, thinking that it might require something like this - piping gcloud builds submit to ts. Does that sound right to you?

If so, I think the modification would have to be made in the image-builder container that is used to kick off the Cloudbuild job (src), but I am having trouble finding the source code for this run.sh command. Locally i can exec into the container and output the contents, but it seems to be calling this builder executable which I also cannot find the source code for, e.g.:

echo "Running..."
if [ -n "${ARTIFACTS}" ] && [ -z "${LOG_TO_STDOUT}" ]; then
  echo "\$ARTIFACTS is set, sending logs to ${ARTIFACTS}"
  /builder --log-dir="${ARTIFACTS}" "$@"
else
  /builder "$@"
fi

I would have assumed all of this would be in the image-builder repo, but can't seem to find the run.sh file or builder source code. Any ideas for best next steps?

@mdbooth
Copy link
Contributor

mdbooth commented Sep 25, 2023

The nightly cloudbuild job is defined here: https://github.com/kubernetes/test-infra/blob/00527b5641ac6d7e5b99249c8c8c6c8a4a636999/config/jobs/image-pushing/k8s-staging-cluster-api.yaml#L379-L412

It uses gcr.io/k8s-staging-test-infra/image-builder:v20230711-e33377c2b4, which is defined here: https://github.com/kubernetes/test-infra/tree/master/images/builder

@@ -3,7 +3,7 @@
timeout: 2700s
options:
substitution_option: ALLOW_LOOSE
machineType: 'E2_HIGHCPU_8'
machineType: 'E2_HIGHCPU_32'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the timeout maybe due to this machine type increase?

Maybe it was not able to get scheduled?

Do the other changes make sense without this line (+ setting 8 in line 15?

Copy link
Member Author

@cahillsf cahillsf Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that seems most likely and would also explain why the timeout did not occur in my environment. the Cloudbuild queue time is visible via the GCP console but is not clear from our current logging. Having the timestamps would help us confirm that theory, but given several runs timed out I am not sure we could take any action to improve this situation anyway.

Do the other changes make sense without this line (+ setting 8 in line 15?

Yes they do and based on my previous testing should give us ~10 mins improvement in job time

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wanted to touch back on here @chrischdi -- what do you think is the best path forward? ill open a new PR leaving the machine type as E2_HIGHCPU_8 and pass in -j 8 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a viable path forward 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/release Issues or PRs related to releasing cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants