Convert Mirror Repo strategy to self-hosted github Runners #264

Gregory-Pereira · 2024-04-14T20:03:24Z

The current repo mirror strategy to drive builds down is not scaleable. We should look to move to using self-hosted Github runners where we can mount the models, stored on persistent storage, to the filesystem in such a way that our tests will not run out storage, and will not have flakiness due to multi-gigabyte model downloads. Even if we could limp along with our current solution, swapping to this strategy will be a requirement of testing our multi-model feature in llamacpp_python model_server.

Initial idea was discussed in the thread beginning with: https://redhat-internal.slack.com/archives/C06S75ZF9JT/p1713089733094399?thread_ts=1712828397.645709&cid=C06S75ZF9JT .

We plan to implement this after Release 1.0 so as to not interfere, but the POC can be developed and run alongside our workloads leading up to and during release.

/assign @lmilbaum
/assign @Gregory-Pereira

The text was updated successfully, but these errors were encountered:

Gregory-Pereira · 2024-04-14T20:07:45Z

Rather than maintaining individual runners manually, we are looking into dynamically provisioning ec2 instances for the runners on spot instances. There were 2 libraries identified that we could leverage in our implementation:

After initial discussion we decided to proceed with a POC using the terraform based implementation due to contribution velocity / resources of the project, in addition to some initial discussions around support for Darwin builds: philips-labs/terraform-aws-github-runner#2069 (comment).

It was determined steps in order to complete this feature request:

start with testing just amd64 and arm64 builds on linux using the terraform based repo
@Gregory-Pereira to identify if his spare mac mini can run as a dedicated runner to solve intermediary OSX amd64 builds
- Look for any intermediary solution around OSX arm64, as this is more important than OSX amd64
Contribute the upstream changes in the terraform based repo around enablement for Darwin builds
Update our CI to use these new Darwin builds once it merges upstream

lmilbaum · 2024-05-05T13:04:23Z

subscription-manager should be available on the self hosted runner to unlock installing RHEL packages when building RHEL based images

lmilbaum · 2024-05-06T11:19:49Z

We would like to align those efforts with the instructlab and osbuild github.com organizations.

lmilbaum · 2024-05-06T11:21:31Z

Related to containers/podman-desktop#7066

cverna · 2024-05-06T11:22:46Z

If you are interested in using Fedora CoreOS for the self hosted runners, I wrote this article a couple years ago ---> https://fedoramagazine.org/run-github-actions-on-fedora-coreos/. Using FCOS makes it really easy to spin up or down instances.

cevich · 2024-05-06T14:28:07Z

I'd like to inform and temper expectations relating to the proposed dynamic/ephemeral runner solution:

GitHub does not recommend this setup for public repositories, it's not safe. Assuming they know their own system better than anybody, I tend to trust their advise.
Dynamic + ephemeral runners require a bot/App with admin access. This is probably "okay" for a single repo, it's going to be a very "hard sell" to force it upon the whole org. A compromise provides unlimited/unrestricted access to everything!
A dedicated Cloud project should be used for this. Given point 1, were some process to escape and gain access to cloud resources, the potential damage-radius should be kept as small/minimal as possible.
Please consider how the overall setup is to be monitored long-term. Passing/Failing jobs isn't likely good enough. There should be some telemetry from the systems that verifies their "short" lifespan.
Complex systems really benefit from good, up-to-date documentation. Others maintainers will likely become involved, and without documentation they will become incredibly frustrated figuring everything out from scratch.

I do not want to be completely negative on this effort, so here are some (hopefully) constructive suggestions:

Could statically provisioned, manually registered runners be "good enough"? No bot's need admin, and it's very simple to maintain, document, and monitor.
Can this repo be moved to a less-populated org, where org-wide admin access is a smaller risk?
Perhaps there is a simpler CI system that could work? Even better if it's maintained and monitored by a dedicated team and/or doesn't require "runners".
Could GHA be used differently, for example directly provision its own cloud resources and orchestrate workloads itself?

ckyrouac · 2024-05-13T17:35:22Z

GitHub does not recommend this setup for public repositories, it's not safe. Assuming they know their own system better than anybody, I tend to trust their advise.

FWIW, I've been looking into adding ephemeral self-hosted runners to the podman-bootc repo. It seems this is more of an upfront warning to say do not use public self hosted runners unless you know what you are doing. AFAICT, a combination of using isolated/ephemeral runners and requiring approvals before running workflows from unknown contributors will significantly mitigate the risk.

https://docs.github.com/en/actions/managing-workflow-runs/approving-workflow-runs-from-public-forks

Some interesting discussion here too: https://github.com/orgs/community/discussions/26722#discussioncomment-3253085

lmilbaum · 2024-05-21T11:37:13Z

WIP proposal doc - https://docs.google.com/document/d/1SmV8Y0qzk5nphamXuI-WccNWgi4ZPKn7_TJ_EVApRK0

cevich · 2024-05-21T18:49:20Z

will significantly mitigate the risk.

Agreed it probably does. I was just concerned about going into this effort mindful there are likely significant/impactful/non-obvious "gotchas" and pitfalls. Security and reliability issues included. Github is closed-source, they have no incentive to disclose all their reasoning for recommendations.

Thanks for the discussion link, I'll be sure to take a read through to educate myself.

lmilbaum · 2024-06-03T16:04:47Z

@Gregory-Pereira @cevich I don't have the cycles to drive this effort. Is that something one of you can drive?

Gregory-Pereira · 2024-06-04T14:34:19Z

I believe @cooktheryan will be driving this effort when he gets back, but I will most certainly help him push it forward and or do the implementation given an agreed upon plan.

lmilbaum · 2024-06-05T09:50:28Z

See @cgwalters's comment over containers/bootc#496. Yet another reason to prioritize this effort.

Gregory-Pereira assigned lmilbaum and Gregory-Pereira Apr 14, 2024

Gregory-Pereira added the CI/CD label Apr 14, 2024

Gregory-Pereira added the size: large label Apr 14, 2024

This was referenced Apr 25, 2024

testing-framework pipeline is broken #327

Closed

WIP: initial body of self-hosted runners TF modules #343

Draft

Enablement for Darwin tests #353

Open

reverting to once a day testing #382

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert Mirror Repo strategy to self-hosted github Runners #264

Convert Mirror Repo strategy to self-hosted github Runners #264

Gregory-Pereira commented Apr 14, 2024

Gregory-Pereira commented Apr 14, 2024 •

edited

lmilbaum commented May 5, 2024

lmilbaum commented May 6, 2024

lmilbaum commented May 6, 2024

cverna commented May 6, 2024

cevich commented May 6, 2024

ckyrouac commented May 13, 2024

lmilbaum commented May 21, 2024 •

edited

cevich commented May 21, 2024

lmilbaum commented Jun 3, 2024

Gregory-Pereira commented Jun 4, 2024

lmilbaum commented Jun 5, 2024

Convert Mirror Repo strategy to self-hosted github Runners #264

Convert Mirror Repo strategy to self-hosted github Runners #264

Comments

Gregory-Pereira commented Apr 14, 2024

Gregory-Pereira commented Apr 14, 2024 • edited

lmilbaum commented May 5, 2024

lmilbaum commented May 6, 2024

lmilbaum commented May 6, 2024

cverna commented May 6, 2024

cevich commented May 6, 2024

ckyrouac commented May 13, 2024

lmilbaum commented May 21, 2024 • edited

cevich commented May 21, 2024

lmilbaum commented Jun 3, 2024

Gregory-Pereira commented Jun 4, 2024

lmilbaum commented Jun 5, 2024

Gregory-Pereira commented Apr 14, 2024 •

edited

lmilbaum commented May 21, 2024 •

edited