Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor GitHub actions #11183

Merged
merged 13 commits into from Jul 22, 2022
Merged

Refactor GitHub actions #11183

merged 13 commits into from Jul 22, 2022

Conversation

abbysmal
Copy link
Contributor

Refactoring and cleaning up Github Actions

This PR is a follow up to #10980
The Multicore merge greatly altered the workflow of Github Actions on ocaml/ocaml.
Multicore OCaml diverged greatly from it on multiple occasion during its lifespan.

This PR goals are manyfold:

  • Clean up and refactor the current workflow: the new test matrices introduced by Multicore are messy and hard to navigate.
  • Bring back features and test scenarios from the 4.X branch (namely the full-flambda run, as well as the other-checks runner pass.)
  • Comment the current test matrix as introduced by Multicore OCaml, and discuss whether or not these addition to ocaml/ocaml are worthwhile.

This PR itself is a bit lengthy, a lost of meaning was lost in the history tree, and it ended up being squashed.

I think the better way to review it would be to compare this tree to the 4.14 equivalent. @dra27 may be interested by this PR.

Cleanup and refactor

This PR introduces a few simplification:

  • Removal of the super testsuite run: I have never been convinced of the usefulness of this run and the code added to runner.sh specifically to support this usecase is quite ugly.
    Historically, this run was introduced to run multiple time the parallel testsuite and catch possibly intermittent failures, in practice I do not think this is valuable as if such failures exists they will be catched eventually in another run on either CI system.
  • TestLoop was simplified to TestPrefix in runner.sh. TestLoop is not needed in its current form since super is gone. TestPrefix also opts to run tests in parallel (to follow suit with Test).
  • build.yml has now four main jobs:
    • normal: does a full testsuite run, builds the doc, attempts to make install, runs other-checks, check for changes in the manual.
    • others:
      • macos: Compile OCaml and run the full testsuite on MacOS
      • linux-O0: Compile OCaml with CFLAGS='-O0' and run a selection of test directories.
    • extra:
      • debug: Does a full run of the testsuite in with the debug runtime.
      • debug-s4096: Runs a selection of test directories with the debug runtime and a minor heap of 4096 words.
    • build: t the normal, debug, debug-s4096 "jobs" all share the same compiler build.
      This build is compiled during the initial build job. The freshly built directory is then uploaded as a build artifact to be reused by the aforementioned steps.

To be noted: the macos and linux-O0 sub-jobs cannot reuse the compiler built during the build job, because they either run on a different OS, or rely specific on different configure parameters when building the OCaml distribution.

Bring back the features lost after the merge

The normal job is a mirror copy of what the GHA runs do on the 4.14 branch.
One major difference from previous Multicore runs: as for 4.X, the "main" testsuite run is now a full flambda run.
The same applies for every subsequent jobs reusing the compiler artifact built during the build job.
One omission from this, is that I did not reimport the i386-static run in the workflow: is it something desireable?

Discussions

The current draft is an attempt at merging what used to be done in trunk before the merge and the various tweaks introduced by the Multicore team.
The both debug runs proved to be especially useful to the team when developping the Multicore runtime.
We tried to strike a nice balance in running time by only running a few select directories (that historically proved useful for us to insist on.) when adequate.

I think the removal of the super job (which aimed to re-run select directories three times without further adjustements) is fine and spare us from burning more CPU cycles.

Two runs I think are worth discussing about:

  • taskset-c0 (which was left out of this PR) was used to catch some synchronization issues within the Multicore runtime. It does not involve a full build of the compiler (as it can reuse the cached build of the normal testsuite run.), but proves to be problematic in some instances. parallel/pingpong.ml is for instance very very slow in this run.
    (which, looking at the testcase makes sense to me). Do we want to bring it back?
  • linux-O0 (which was left in this PR) caught a few times some issues in the Multicore runtime, catching some problems when optimizations were removed. I could not trace back such scenarios. It involves a full, separate compiler build, as well as running some parts of the testsuite (a full run would be prohibitively slow.)

One more thing worth pondering upon, the Github Actions runner does not use OCAML_TEST_SIZE to provide hints to the runners on the number of cores available: Should this be added as well?

@dra27
Copy link
Member

dra27 commented Apr 24, 2022

Thanks for grasping this nettle, @Engil! Extra commits pushed to:

  • Fix the "Check for manual changes" step
  • Restore the test for building the manual (this is temporarily using my fork of lambdasoup, but I think that the benefit of having the manual CI running while there are quite so many manual chapters being added/revised is probably worth it)
  • Tweaks make check_all_arches and the extraction commands to ensure the other-checks step passes

A review of the other parts to follow 🙂

@dra27
Copy link
Member

dra27 commented May 10, 2022

@abbysmal
Copy link
Contributor Author

abbysmal commented Jun 7, 2022

@dra27 thank you for your additions, I will review the changes soon. :)
Is there anything we are missing to push this PR through?

@dra27
Copy link
Member

dra27 commented Jul 22, 2022

As the four most recent commits testify, we really ought to get this merged 🙈

@Engil - the notes I'd had were:

  • i386-static job: there's either Test 32-bit build in GitHub Actions #11143 or the work you've been doing for arm32 GitHub Actions testing
  • I agree with the removal of tasket-c0. I think we should consider removing linux-O0 from the GHA pipeline - but this might usefully be run in other-configs on Jenkins? My rationale for suggesting that is that these are useful data-points… we have checked large numbers of these logs in the past to spot the intermittent failures and then used that knowledge to guide fixes, so it’s not totally wasted energy use. However, it’s not necessarily useful feedback on a PR.
  • We should utilise OCAML_TEST_SIZE, yes.

However, this PR is most definitely a step-wise improvement over the current pipelines so I suggest that if you're happy with the commits I've pushed and as I'm happy with your part of the PR that we go ahead and merge and iterate in another PR?

@abbysmal
Copy link
Contributor Author

Your latest additions LGTM, I think this is a long overdue improvement and we need to build upon this.

@dra27
Copy link
Member

dra27 commented Jul 22, 2022

Ta - I'll merge it when CI returns (🤞) and cherry-pick to 5.0

@dra27 dra27 merged commit 3ad1567 into ocaml:trunk Jul 22, 2022
dra27 pushed a commit to dra27/ocaml that referenced this pull request Jul 22, 2022
Refactor GitHub actions

(cherry picked from commit 3ad1567)
@Octachron
Copy link
Member

Am I reading correctly that the MacOs job is running without flambda, and thus we have at least one job without flambda enabled?

@dra27
Copy link
Member

dra27 commented Jul 22, 2022

Yes, that's correct - the Windows build is also done without flambda (as on 4.14)

@Octachron
Copy link
Member

Thanks for the confirmation!

@xavierleroy
Copy link
Contributor

I'd rather have most CI jobs without flambda and one job with flambda, if only because non-flambda builds take less time.

We should utilise OCAML_TEST_SIZE, yes.

If OCAML_TEST_SIZE is not set, the test suite tries to scale down to 2-core virtual machines, which is what GHA offers. So, just leaving OCAML_TEST_SIZE unset should work fine.

@xavierleroy
Copy link
Contributor

I think we should consider removing linux-O0 from the GHA pipeline - but this might usefully be run in other-configs on Jenkins?

It should be easy to add a -O0 case to the other-configs Jenkins job. I'm not completely convinced it's going to find bugs that other builds didn't spot, but I can go along.

@dra27
Copy link
Member

dra27 commented Jul 22, 2022

Yes, we have lost a little bit of time because the debug-runtime testsuite is now being run with an flambda compiler. We could remove the -O0 build and do a non-flambda + debug runtime? The build time of this PR isn't a typical comparison, because it touches the manual (PRs which don't touch the manual save ~10 minutes by skipping the build and test of it).

@ctk21 has been a fan of the -O0 testsuite, although I expect that over time we should expect diminishing returns from running it 🙂

@xavierleroy
Copy link
Contributor

Speaking of the manual build: I'm now getting failures with the "normal" job when it tries to build the manual, see e.g.
https://github.com/ocaml/ocaml/runs/7495836889?check_suite_focus=true#step:8:10 . Can you please fix this? or disable the building of the manual?

@dra27
Copy link
Member

dra27 commented Jul 25, 2022

@xavierleroy - I'm 99% certain that what happened here was that the PR was merged before the full pipeline had finished, as the timestamp on the job start for normal and both the testsuite runs is after the PR's merging. That meant that when the "normal" job tried to fetch the additional commits needed to analyse the repo, the git fetch failed because the PR's "magic ref" ceases to exist after it's been merged.

I'll open a PR to move that analysis to the "build" job and then have it communicate with output variables, but I don't think it's that testing the manual's build is itself fundamentally broken.

@xavierleroy
Copy link
Contributor

I'm 99% certain that what happened here was that the PR was merged before the full pipeline had finished, as the timestamp on the job start for normal and both the testsuite runs is after the PR's merging. That meant that when the "normal" job tried to fetch the additional commits needed to analyse the repo, the git fetch failed because the PR's "magic ref" ceases to exist after it's been merged.

Would it be possible to just skip the test in this case, rather than marking it as a failure? It's not that something wrong was found in the manual, just that the test could not be run because of circumstances.

Whatever solution you choose, I would appreciate no longer receiving "PR run failed: Build" e-mails from Github, which I have to investigate every time, just to see that the problem is with the test of the manual. Thank you for your understanding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants