Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert Mirror Repo strategy to self-hosted github Runners #264

Open
Gregory-Pereira opened this issue Apr 14, 2024 · 12 comments
Open

Convert Mirror Repo strategy to self-hosted github Runners #264

Gregory-Pereira opened this issue Apr 14, 2024 · 12 comments

Comments

@Gregory-Pereira
Copy link
Collaborator

The current repo mirror strategy to drive builds down is not scaleable. We should look to move to using self-hosted Github runners where we can mount the models, stored on persistent storage, to the filesystem in such a way that our tests will not run out storage, and will not have flakiness due to multi-gigabyte model downloads. Even if we could limp along with our current solution, swapping to this strategy will be a requirement of testing our multi-model feature in llamacpp_python model_server.

Initial idea was discussed in the thread beginning with: https://redhat-internal.slack.com/archives/C06S75ZF9JT/p1713089733094399?thread_ts=1712828397.645709&cid=C06S75ZF9JT .

We plan to implement this after Release 1.0 so as to not interfere, but the POC can be developed and run alongside our workloads leading up to and during release.

/assign @lmilbaum
/assign @Gregory-Pereira

@Gregory-Pereira
Copy link
Collaborator Author

Gregory-Pereira commented Apr 14, 2024

Rather than maintaining individual runners manually, we are looking into dynamically provisioning ec2 instances for the runners on spot instances. There were 2 libraries identified that we could leverage in our implementation:

  1. https://github.com/philips-labs/terraform-aws-github-runner
  2. https://github.com/machulav/ec2-github-runner

After initial discussion we decided to proceed with a POC using the terraform based implementation due to contribution velocity / resources of the project, in addition to some initial discussions around support for Darwin builds: philips-labs/terraform-aws-github-runner#2069 (comment).

It was determined steps in order to complete this feature request:

  1. start with testing just amd64 and arm64 builds on linux using the terraform based repo
  2. @Gregory-Pereira to identify if his spare mac mini can run as a dedicated runner to solve intermediary OSX amd64 builds
    • Look for any intermediary solution around OSX arm64, as this is more important than OSX amd64
  3. Contribute the upstream changes in the terraform based repo around enablement for Darwin builds
  4. Update our CI to use these new Darwin builds once it merges upstream

@lmilbaum
Copy link
Collaborator

lmilbaum commented May 5, 2024

subscription-manager should be available on the self hosted runner to unlock installing RHEL packages when building RHEL based images

@lmilbaum
Copy link
Collaborator

lmilbaum commented May 6, 2024

We would like to align those efforts with the instructlab and osbuild github.com organizations.

@lmilbaum
Copy link
Collaborator

lmilbaum commented May 6, 2024

Related to containers/podman-desktop#7066

@cverna
Copy link

cverna commented May 6, 2024

If you are interested in using Fedora CoreOS for the self hosted runners, I wrote this article a couple years ago ---> https://fedoramagazine.org/run-github-actions-on-fedora-coreos/. Using FCOS makes it really easy to spin up or down instances.

@cevich
Copy link
Member

cevich commented May 6, 2024

I'd like to inform and temper expectations relating to the proposed dynamic/ephemeral runner solution:

  1. GitHub does not recommend this setup for public repositories, it's not safe. Assuming they know their own system better than anybody, I tend to trust their advise.
  2. Dynamic + ephemeral runners require a bot/App with admin access. This is probably "okay" for a single repo, it's going to be a very "hard sell" to force it upon the whole org. A compromise provides unlimited/unrestricted access to everything!
  3. A dedicated Cloud project should be used for this. Given point 1, were some process to escape and gain access to cloud resources, the potential damage-radius should be kept as small/minimal as possible.
  4. Please consider how the overall setup is to be monitored long-term. Passing/Failing jobs isn't likely good enough. There should be some telemetry from the systems that verifies their "short" lifespan.
  5. Complex systems really benefit from good, up-to-date documentation. Others maintainers will likely become involved, and without documentation they will become incredibly frustrated figuring everything out from scratch.

I do not want to be completely negative on this effort, so here are some (hopefully) constructive suggestions:

  • Could statically provisioned, manually registered runners be "good enough"? No bot's need admin, and it's very simple to maintain, document, and monitor.
  • Can this repo be moved to a less-populated org, where org-wide admin access is a smaller risk?
  • Perhaps there is a simpler CI system that could work? Even better if it's maintained and monitored by a dedicated team and/or doesn't require "runners".
  • Could GHA be used differently, for example directly provision its own cloud resources and orchestrate workloads itself?

@ckyrouac
Copy link
Contributor

GitHub does not recommend this setup for public repositories, it's not safe. Assuming they know their own system better than anybody, I tend to trust their advise.

FWIW, I've been looking into adding ephemeral self-hosted runners to the podman-bootc repo. It seems this is more of an upfront warning to say do not use public self hosted runners unless you know what you are doing. AFAICT, a combination of using isolated/ephemeral runners and requiring approvals before running workflows from unknown contributors will significantly mitigate the risk.

https://docs.github.com/en/actions/managing-workflow-runs/approving-workflow-runs-from-public-forks

Some interesting discussion here too: https://github.com/orgs/community/discussions/26722#discussioncomment-3253085

@lmilbaum
Copy link
Collaborator

lmilbaum commented May 21, 2024

@cevich
Copy link
Member

cevich commented May 21, 2024

will significantly mitigate the risk.

Agreed it probably does. I was just concerned about going into this effort mindful there are likely significant/impactful/non-obvious "gotchas" and pitfalls. Security and reliability issues included. Github is closed-source, they have no incentive to disclose all their reasoning for recommendations.

Thanks for the discussion link, I'll be sure to take a read through to educate myself.

@lmilbaum
Copy link
Collaborator

lmilbaum commented Jun 3, 2024

@Gregory-Pereira @cevich I don't have the cycles to drive this effort. Is that something one of you can drive?

@Gregory-Pereira
Copy link
Collaborator Author

I believe @cooktheryan will be driving this effort when he gets back, but I will most certainly help him push it forward and or do the implementation given an agreed upon plan.

@lmilbaum
Copy link
Collaborator

lmilbaum commented Jun 5, 2024

See @cgwalters's comment over containers/bootc#496. Yet another reason to prioritize this effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

No branches or pull requests

5 participants