Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve GH actions release process #3186

Open
JonasKunz opened this issue Jun 13, 2023 · 7 comments
Open

Improve GH actions release process #3186

JonasKunz opened this issue Jun 13, 2023 · 7 comments

Comments

@JonasKunz
Copy link
Contributor

JonasKunz commented Jun 13, 2023

Version 1.39.0 was the first attempt of releasing via GH actions. It uncovered a few problems.

1st attempt:

  • There was a bug in the job which calls build-kite to perform the release, which meant that mvn deploy was not triggered
  • The release preparation step already had committed the version bumps and the new tag.
  • Resolution:
    • Fix the GH-action via a PR
    • Revert the version bump commits via a PR
    • Manual deletion of the tag (required admin permissions!)
    • Retry in 2nd attempt.

2nd attempt:

  • The Deploy to maven central "failed", however the actual artifact was actually published to maven central correctly. We think that the final "publish" of the release returned a strange response / timed out, but actually went through.
  • As a result, all of the follow-up tasks were aborted due to "failed" maven central publish step
  • Resolution:
    • We added boolean inputs to skip the prepare release and Deploy to maven central steps
    • This made everything work except for the AWS-lambda upload, as it depends on the uploaded lambda-zip from the buildkite build. Subsequently, also the creation of the GH release failed.
    • Those had to be done manually (required admin permissions and AWS credentials).

The GH actions release workflow was designed with atomicity of the individual steps in mind, so that in theory a failure could be mitigate by just using the Rerun failed Jobs feature. Unfortunately, this does not take the case where a job fails but silently succeeds (e.g. our maven deploy) into account, which is where things went wrong.
In addition, the Rerun failed jobs does not work correctly if there is a bug in the workflow or a related script. To my knowledge, the Rerun failed jobs would always use the same revision, therefore no fixes to the scripts can be applied.

We'll most likely need to split the process up into individual workflows, which ideally are idempotent.
We especially should adjust the lambda-task to not depend on a previous uploaded build artifact, but to use the apm-agent from maven central.

@SylvainJuge
Copy link
Member

Note to my future self so my present self can forget the details:

  • We have to delegate some tasks to buildkite for artifacts signature
  • We should probably split the release workflow in independent workflows
  • Ideas for splitting:
    • first part that creates the tag and pushes change to main, should end with triggering the artifact publication with buildkite
    • second part executes in buildkite
    • third part waiting on the publication in maven central (or better triggered by buildkite job when it completes)
    • if the buildkite part fails, then we should be able to investigate it, then trigger the third part manually if needed

@jackshirazi
Copy link
Contributor

What we really need is being able to run this in a test mode. A test branch, and test run in buildkite using that branch, against a test maven account or mock site, etc

@jackshirazi
Copy link
Contributor

If buildkite returns failure, but maven central is still processing and will eventually succeed, is there any way we can tell about the maven central processing? Previously we could see it pending because we were logged in doing it, but now we could assume that it failed and try to re-run it and then find that maven succeeds later - what would happen in that scenario?

@JonasKunz
Copy link
Contributor Author

JonasKunz commented Jun 15, 2023

but now we could assume that it failed and try to re-run it and then find that maven succeeds later - what would happen in that scenario

I would expect the re-run to fail, because maven does not allow publishing the same artifact with the same version twice.

@jackshirazi
Copy link
Contributor

I would expect the re-run to fail, because maven does not allow publishing the same artifact with the same version twice.

In which case the buildkite job can be made entirely idempotent by first checking if the artifacts are available in maven - and if so skipping the rest of the job, but if not, just running the job because even if an ongoing update is happening in sonatype, the subsequent buildkite attempt should fail? I worry that sonatype is not really robust enough to take this kind of risk though

@SylvainJuge
Copy link
Member

I think the buildkite part of the job should actually do two things:

  • build the artifact and trigger publication to maven central.
  • wait for the artifact to be published, with a timeout.

When everything goes well, the job completes when the artifact is published.
When things go wrong, either in the release or the "wait for publication" part, we should investigate the failure manually:

  • if it's a timeout on the publication because maven central is slow, then we can wait, but the "trigger publication" does not need to be triggered again.
  • if it's something else, it will likely require manual intervention.
  • in both cases, the part that follows the artifact publication will need to be triggered manually.

In order to attempt to make things idempotent, we could maybe use some dedicated moving git branches like the stable that would be used to indicate the current state of the release to encode the "state" of the release process:

  • v1.2.3 tag is created by the mvn release:prepare
  • v1.2.3-artifact branch is set to the same release commit v1.2.3 when the artifact is built and pending publication
  • when published, we can just query the state of the public maven repository, thus no more tag is needed.

I don't like this approach as it sounds brittle, and we don't have a way to keep the built and signed artifact binaries.
Maybe having an intermediate repository where we push the binaries before maven central (or a copy of the build workspace just before publishing) could be used as a way to indicate "we built the artifacts, but they aren't published yet".

Also another aspect that might be worth investigating is that if we make the agent produce reproducible artifacts (doc), then building the release artifacts twice and attempting to publish them more than once might not be an issue.

@jackshirazi
Copy link
Contributor

Having an intermediate location which is not overwriteable for binaries is a great idea! That decouples the most painful part of the process from the fragility

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants