Refactor GitHub actions #11183

abbysmal · 2022-04-11T20:39:17Z

Refactoring and cleaning up Github Actions

This PR is a follow up to #10980
The Multicore merge greatly altered the workflow of Github Actions on ocaml/ocaml.
Multicore OCaml diverged greatly from it on multiple occasion during its lifespan.

This PR goals are manyfold:

Clean up and refactor the current workflow: the new test matrices introduced by Multicore are messy and hard to navigate.
Bring back features and test scenarios from the 4.X branch (namely the full-flambda run, as well as the other-checks runner pass.)
Comment the current test matrix as introduced by Multicore OCaml, and discuss whether or not these addition to ocaml/ocaml are worthwhile.

This PR itself is a bit lengthy, a lost of meaning was lost in the history tree, and it ended up being squashed.

I think the better way to review it would be to compare this tree to the 4.14 equivalent. @dra27 may be interested by this PR.

Cleanup and refactor

This PR introduces a few simplification:

Removal of the super testsuite run: I have never been convinced of the usefulness of this run and the code added to runner.sh specifically to support this usecase is quite ugly.
Historically, this run was introduced to run multiple time the parallel testsuite and catch possibly intermittent failures, in practice I do not think this is valuable as if such failures exists they will be catched eventually in another run on either CI system.
TestLoop was simplified to TestPrefix in runner.sh. TestLoop is not needed in its current form since super is gone. TestPrefix also opts to run tests in parallel (to follow suit with Test).
build.yml has now four main jobs:
- normal: does a full testsuite run, builds the doc, attempts to make install, runs other-checks, check for changes in the manual.
- others:
  - macos: Compile OCaml and run the full testsuite on MacOS
  - linux-O0: Compile OCaml with CFLAGS='-O0' and run a selection of test directories.
- extra:
  - debug: Does a full run of the testsuite in with the debug runtime.
  - debug-s4096: Runs a selection of test directories with the debug runtime and a minor heap of 4096 words.
- build: t the normal, debug, debug-s4096 "jobs" all share the same compiler build.
  This build is compiled during the initial build job. The freshly built directory is then uploaded as a build artifact to be reused by the aforementioned steps.

To be noted: the macos and linux-O0 sub-jobs cannot reuse the compiler built during the build job, because they either run on a different OS, or rely specific on different configure parameters when building the OCaml distribution.

Bring back the features lost after the merge

The normal job is a mirror copy of what the GHA runs do on the 4.14 branch.
One major difference from previous Multicore runs: as for 4.X, the "main" testsuite run is now a full flambda run.
The same applies for every subsequent jobs reusing the compiler artifact built during the build job.
One omission from this, is that I did not reimport the i386-static run in the workflow: is it something desireable?

Discussions

The current draft is an attempt at merging what used to be done in trunk before the merge and the various tweaks introduced by the Multicore team.
The both debug runs proved to be especially useful to the team when developping the Multicore runtime.
We tried to strike a nice balance in running time by only running a few select directories (that historically proved useful for us to insist on.) when adequate.

I think the removal of the super job (which aimed to re-run select directories three times without further adjustements) is fine and spare us from burning more CPU cycles.

Two runs I think are worth discussing about:

taskset-c0 (which was left out of this PR) was used to catch some synchronization issues within the Multicore runtime. It does not involve a full build of the compiler (as it can reuse the cached build of the normal testsuite run.), but proves to be problematic in some instances. parallel/pingpong.ml is for instance very very slow in this run.
(which, looking at the testcase makes sense to me). Do we want to bring it back?
linux-O0 (which was left in this PR) caught a few times some issues in the Multicore runtime, catching some problems when optimizations were removed. I could not trace back such scenarios. It involves a full, separate compiler build, as well as running some parts of the testsuite (a full run would be prohibitively slow.)

One more thing worth pondering upon, the Github Actions runner does not use OCAML_TEST_SIZE to provide hints to the runners on the number of cores available: Should this be added as well?

dra27 · 2022-04-24T16:22:19Z

Thanks for grasping this nettle, @Engil! Extra commits pushed to:

Fix the "Check for manual changes" step
Restore the test for building the manual (this is temporarily using my fork of lambdasoup, but I think that the benefit of having the manual CI running while there are quite so many manual chapters being added/revised is probably worth it)
Tweaks make check_all_arches and the extraction commands to ensure the other-checks step passes

A review of the other parts to follow 🙂

dra27 · 2022-05-10T14:55:55Z

Fix undefined variable warning for $( ) in Makefiles #10270 test for undefined Makefile variables needs restoring

abbysmal · 2022-06-07T10:00:50Z

@dra27 thank you for your additions, I will review the changes soon. :)
Is there anything we are missing to push this PR through?

The credential store won't have been configured when the artefact is checked out, so disable the header override before uploading it.

This reverts commit bbbe9e5.

Especially now that it's in a separate build stage.

Interferes with the clean working directory other checks.

dra27 · 2022-07-22T07:45:18Z

As the four most recent commits testify, we really ought to get this merged 🙈

@Engil - the notes I'd had were:

i386-static job: there's either Test 32-bit build in GitHub Actions #11143 or the work you've been doing for arm32 GitHub Actions testing
I agree with the removal of tasket-c0. I think we should consider removing linux-O0 from the GHA pipeline - but this might usefully be run in other-configs on Jenkins? My rationale for suggesting that is that these are useful data-points… we have checked large numbers of these logs in the past to spot the intermittent failures and then used that knowledge to guide fixes, so it’s not totally wasted energy use. However, it’s not necessarily useful feedback on a PR.
We should utilise OCAML_TEST_SIZE, yes.

However, this PR is most definitely a step-wise improvement over the current pipelines so I suggest that if you're happy with the commits I've pushed and as I'm happy with your part of the PR that we go ahead and merge and iterate in another PR?

abbysmal · 2022-07-22T08:05:31Z

Your latest additions LGTM, I think this is a long overdue improvement and we need to build upon this.

dra27 · 2022-07-22T08:07:16Z

Ta - I'll merge it when CI returns (🤞) and cherry-pick to 5.0

Refactor GitHub actions (cherry picked from commit 3ad1567)

Octachron · 2022-07-22T09:40:22Z

Am I reading correctly that the MacOs job is running without flambda, and thus we have at least one job without flambda enabled?

dra27 · 2022-07-22T10:27:45Z

Yes, that's correct - the Windows build is also done without flambda (as on 4.14)

Octachron · 2022-07-22T10:51:56Z

Thanks for the confirmation!

xavierleroy · 2022-07-22T18:15:50Z

I'd rather have most CI jobs without flambda and one job with flambda, if only because non-flambda builds take less time.

We should utilise OCAML_TEST_SIZE, yes.

If OCAML_TEST_SIZE is not set, the test suite tries to scale down to 2-core virtual machines, which is what GHA offers. So, just leaving OCAML_TEST_SIZE unset should work fine.

xavierleroy · 2022-07-22T18:19:03Z

I think we should consider removing linux-O0 from the GHA pipeline - but this might usefully be run in other-configs on Jenkins?

It should be easy to add a -O0 case to the other-configs Jenkins job. I'm not completely convinced it's going to find bugs that other builds didn't spot, but I can go along.

dra27 · 2022-07-22T19:13:47Z

Yes, we have lost a little bit of time because the debug-runtime testsuite is now being run with an flambda compiler. We could remove the -O0 build and do a non-flambda + debug runtime? The build time of this PR isn't a typical comparison, because it touches the manual (PRs which don't touch the manual save ~10 minutes by skipping the build and test of it).

@ctk21 has been a fan of the -O0 testsuite, although I expect that over time we should expect diminishing returns from running it 🙂

xavierleroy · 2022-07-25T08:38:45Z

Speaking of the manual build: I'm now getting failures with the "normal" job when it tries to build the manual, see e.g.
https://github.com/ocaml/ocaml/runs/7495836889?check_suite_focus=true#step:8:10 . Can you please fix this? or disable the building of the manual?

dra27 · 2022-07-25T10:24:50Z

@xavierleroy - I'm 99% certain that what happened here was that the PR was merged before the full pipeline had finished, as the timestamp on the job start for normal and both the testsuite runs is after the PR's merging. That meant that when the "normal" job tried to fetch the additional commits needed to analyse the repo, the git fetch failed because the PR's "magic ref" ceases to exist after it's been merged.

I'll open a PR to move that analysis to the "build" job and then have it communicate with output variables, but I don't think it's that testing the manual's build is itself fundamentally broken.

xavierleroy · 2022-07-27T08:19:56Z

I'm 99% certain that what happened here was that the PR was merged before the full pipeline had finished, as the timestamp on the job start for normal and both the testsuite runs is after the PR's merging. That meant that when the "normal" job tried to fetch the additional commits needed to analyse the repo, the git fetch failed because the PR's "magic ref" ceases to exist after it's been merged.

Would it be possible to just skip the test in this case, rather than marking it as a failure? It's not that something wrong was found in the manual, just that the test could not be run because of circumstances.

Whatever solution you choose, I would appreciate no longer receiving "PR run failed: Build" e-mails from Github, which I have to investigate every time, just to see that the problem is with the test of the manual. Thank you for your understanding.

abbysmal assigned dra27 Apr 11, 2022

abbysmal added the no-change-entry-needed label Apr 11, 2022

dra27 mentioned this pull request Apr 24, 2022

Limit bit-rot in the disabled backends #11217

Merged

dra27 force-pushed the refactor_github_actions branch from 1b38305 to e54bd38 Compare July 22, 2022 06:21

abbysmal and others added 13 commits July 22, 2022 08:15

actions: reuse build artifacts for the debug runtime run as well

c8fc8c0

testsuite: remove taskset-c0

f27def7

Fix workflow yaml

0357145

Reset git credential configuration before upload

430b226

The credential store won't have been configured when the artefact is checked out, so disable the header override before uploading it.

Revert "Disable manual build temporarily"

6f7fff7

This reverts commit bbbe9e5.

Log the result of "Check for manual changes"

055023e

Especially now that it's in a separate build stage.

Update manual's build dependencies

e73538e

Temporarily use dra27 lambdasoup

4573f04

Remove artefact tarball after extraction

f917226

Interferes with the clean working directory other checks.

Fix riscv64 backend

ad99f06

Fix two tests under debug runtime

194ade2

Add missing pattern to make -C ocamldoc clean

8494719

Update flambda reference for for backtrace_dynlink

4cf63f0

dra27 force-pushed the refactor_github_actions branch from e54bd38 to 4cf63f0 Compare July 22, 2022 07:41

dra27 merged commit 3ad1567 into ocaml:trunk Jul 22, 2022

dra27 pushed a commit to dra27/ocaml that referenced this pull request Jul 22, 2022

Merge pull request ocaml#11183 from Engil/refactor_github_actions

62ad584

Refactor GitHub actions (cherry picked from commit 3ad1567)

abbysmal mentioned this pull request Jul 25, 2022

GitHub Actions / ocamltest / testsuite / OCaml 5 #10980

Open

16 tasks

abbysmal mentioned this pull request Aug 2, 2022

Github Actions: Move manual changes logic earlier in the build process #11470

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor GitHub actions #11183

Refactor GitHub actions #11183

abbysmal commented Apr 11, 2022

dra27 commented Apr 24, 2022

dra27 commented May 10, 2022

abbysmal commented Jun 7, 2022

dra27 commented Jul 22, 2022

abbysmal commented Jul 22, 2022

dra27 commented Jul 22, 2022

Octachron commented Jul 22, 2022

dra27 commented Jul 22, 2022

Octachron commented Jul 22, 2022

xavierleroy commented Jul 22, 2022

xavierleroy commented Jul 22, 2022

dra27 commented Jul 22, 2022

xavierleroy commented Jul 25, 2022

dra27 commented Jul 25, 2022

xavierleroy commented Jul 27, 2022

Refactor GitHub actions #11183

Refactor GitHub actions #11183

Conversation

abbysmal commented Apr 11, 2022

Refactoring and cleaning up Github Actions

Cleanup and refactor

Bring back the features lost after the merge

Discussions

dra27 commented Apr 24, 2022

dra27 commented May 10, 2022

abbysmal commented Jun 7, 2022

dra27 commented Jul 22, 2022

abbysmal commented Jul 22, 2022

dra27 commented Jul 22, 2022

Octachron commented Jul 22, 2022

dra27 commented Jul 22, 2022

Octachron commented Jul 22, 2022

xavierleroy commented Jul 22, 2022

xavierleroy commented Jul 22, 2022

dra27 commented Jul 22, 2022

xavierleroy commented Jul 25, 2022

dra27 commented Jul 25, 2022

xavierleroy commented Jul 27, 2022