Improve GH actions release process #3186

JonasKunz · 2023-06-13T12:19:29Z

Version 1.39.0 was the first attempt of releasing via GH actions. It uncovered a few problems.

1st attempt:

There was a bug in the job which calls build-kite to perform the release, which meant that mvn deploy was not triggered
The release preparation step already had committed the version bumps and the new tag.
Resolution:
- Fix the GH-action via a PR
- Revert the version bump commits via a PR
- Manual deletion of the tag (required admin permissions!)
- Retry in 2nd attempt.

2nd attempt:

The Deploy to maven central "failed", however the actual artifact was actually published to maven central correctly. We think that the final "publish" of the release returned a strange response / timed out, but actually went through.
As a result, all of the follow-up tasks were aborted due to "failed" maven central publish step
Resolution:
- We added boolean inputs to skip the prepare release and Deploy to maven central steps
- This made everything work except for the AWS-lambda upload, as it depends on the uploaded lambda-zip from the buildkite build. Subsequently, also the creation of the GH release failed.
- Those had to be done manually (required admin permissions and AWS credentials).

The GH actions release workflow was designed with atomicity of the individual steps in mind, so that in theory a failure could be mitigate by just using the Rerun failed Jobs feature. Unfortunately, this does not take the case where a job fails but silently succeeds (e.g. our maven deploy) into account, which is where things went wrong.
In addition, the Rerun failed jobs does not work correctly if there is a bug in the workflow or a related script. To my knowledge, the Rerun failed jobs would always use the same revision, therefore no fixes to the scripts can be applied.

We'll most likely need to split the process up into individual workflows, which ideally are idempotent.
We especially should adjust the lambda-task to not depend on a previous uploaded build artifact, but to use the apm-agent from maven central.

The text was updated successfully, but these errors were encountered:

SylvainJuge · 2023-06-14T14:00:31Z

Note to my future self so my present self can forget the details:

We have to delegate some tasks to buildkite for artifacts signature
We should probably split the release workflow in independent workflows
Ideas for splitting:
- first part that creates the tag and pushes change to main, should end with triggering the artifact publication with buildkite
- second part executes in buildkite
- third part waiting on the publication in maven central (or better triggered by buildkite job when it completes)
- if the buildkite part fails, then we should be able to investigate it, then trigger the third part manually if needed

jackshirazi · 2023-06-14T14:37:43Z

What we really need is being able to run this in a test mode. A test branch, and test run in buildkite using that branch, against a test maven account or mock site, etc

jackshirazi · 2023-06-14T15:16:45Z

If buildkite returns failure, but maven central is still processing and will eventually succeed, is there any way we can tell about the maven central processing? Previously we could see it pending because we were logged in doing it, but now we could assume that it failed and try to re-run it and then find that maven succeeds later - what would happen in that scenario?

JonasKunz · 2023-06-15T08:09:05Z

but now we could assume that it failed and try to re-run it and then find that maven succeeds later - what would happen in that scenario

I would expect the re-run to fail, because maven does not allow publishing the same artifact with the same version twice.

jackshirazi · 2023-06-15T09:41:40Z

I would expect the re-run to fail, because maven does not allow publishing the same artifact with the same version twice.

In which case the buildkite job can be made entirely idempotent by first checking if the artifacts are available in maven - and if so skipping the rest of the job, but if not, just running the job because even if an ongoing update is happening in sonatype, the subsequent buildkite attempt should fail? I worry that sonatype is not really robust enough to take this kind of risk though

SylvainJuge · 2023-06-15T12:01:58Z

I think the buildkite part of the job should actually do two things:

build the artifact and trigger publication to maven central.
wait for the artifact to be published, with a timeout.

When everything goes well, the job completes when the artifact is published.
When things go wrong, either in the release or the "wait for publication" part, we should investigate the failure manually:

if it's a timeout on the publication because maven central is slow, then we can wait, but the "trigger publication" does not need to be triggered again.
if it's something else, it will likely require manual intervention.
in both cases, the part that follows the artifact publication will need to be triggered manually.

In order to attempt to make things idempotent, we could maybe use some dedicated moving git branches like the stable that would be used to indicate the current state of the release to encode the "state" of the release process:

v1.2.3 tag is created by the mvn release:prepare
v1.2.3-artifact branch is set to the same release commit v1.2.3 when the artifact is built and pending publication
when published, we can just query the state of the public maven repository, thus no more tag is needed.

I don't like this approach as it sounds brittle, and we don't have a way to keep the built and signed artifact binaries.
Maybe having an intermediate repository where we push the binaries before maven central (or a copy of the build workspace just before publishing) could be used as a way to indicate "we built the artifacts, but they aren't published yet".

Also another aspect that might be worth investigating is that if we make the agent produce reproducible artifacts (doc), then building the release artifacts twice and attempting to publish them more than once might not be an issue.

jackshirazi · 2023-06-16T14:59:16Z

Having an intermediate location which is not overwriteable for binaries is a great idea! That decouples the most painful part of the process from the fragility

JonasKunz added the 8.10-candidate label Jun 13, 2023

github-actions bot added the agent-java label Jun 13, 2023

AlexanderWert mentioned this issue Jun 26, 2023

Migrate release from Jenkins to GitHub Actions #3005

Closed

AlexanderWert added this to the 8.10 milestone Jun 26, 2023

AlexanderWert removed the 8.10-candidate label Jun 26, 2023

JonasKunz mentioned this issue Jul 20, 2023

Decouple lambda layer release from buildkite build #3251

Merged

2 tasks

AlexanderWert removed this from the 8.10 milestone Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve GH actions release process #3186

Improve GH actions release process #3186

JonasKunz commented Jun 13, 2023 •

edited

SylvainJuge commented Jun 14, 2023

jackshirazi commented Jun 14, 2023

jackshirazi commented Jun 14, 2023

JonasKunz commented Jun 15, 2023 •

edited

jackshirazi commented Jun 15, 2023

SylvainJuge commented Jun 15, 2023

jackshirazi commented Jun 16, 2023

Improve GH actions release process #3186

Improve GH actions release process #3186

Comments

JonasKunz commented Jun 13, 2023 • edited

SylvainJuge commented Jun 14, 2023

jackshirazi commented Jun 14, 2023

jackshirazi commented Jun 14, 2023

JonasKunz commented Jun 15, 2023 • edited

jackshirazi commented Jun 15, 2023

SylvainJuge commented Jun 15, 2023

jackshirazi commented Jun 16, 2023

JonasKunz commented Jun 13, 2023 •

edited

JonasKunz commented Jun 15, 2023 •

edited