Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrated MR - Fedora CoreOS Stable/Testing/Next Support #8

Open
chris-bateman opened this issue Mar 5, 2024 · 3 comments
Open

Migrated MR - Fedora CoreOS Stable/Testing/Next Support #8

chris-bateman opened this issue Mar 5, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@chris-bateman
Copy link

Migrating this as we would like to see native support for Fedora CoreOS in this project.
See - https://gitlab.com/nvidia/container-images/driver/-/issues/34

@fifofonix
Copy link
Contributor

fifofonix commented Mar 26, 2024

Summarizing the linked issue with what I believe to be the latest status.

There is no official support for Fedora and Fedora CoreOS (FCOS) presently (ie. no docker images are published automatically to the NVIDIA container registry). A community contribution under ci\fedora\.gitlab-ci-fcos.yml provides an alternate gitlab CICD file that can be used to build FCOS next/testing/stable driver container images with pre-compiled kernel headers, but it assumes the existence of FCOS gitlab-runners tracking the various streams ^1.

My fork of this project on gitlab has been using this strategy successfully for a couple of years now to build/push images to dockerhub in this way.

The resultant images are currently successfully used on typhoon k8s FCOS clusters. However, they are running nvidia-device-plugin with the container toolkit installed on the host and the driver container running outside of k8s. Although way back when I had gpu-operator running I have not been able to get that running more recently for reasons unknown.

^1 Note that it should be possible to build functioning images without pre-compiled kernel headers without using FCOS gitlab-runners.

@zwshan
Copy link

zwshan commented May 28, 2024

Hello, I followed the instructions in your document and executed the commands,

$ DRIVER_VERSION=535.154.05 # Check ci/fedora/.common-ci-fcos.yml for latest
$ FEDORA_VERSION_ID=$(cat /etc/os-release | grep VERSION_ID | cut -d = -f2)
$ podman run -d --privileged --pid=host \
     -v /run/nvidia:/run/nvidia:shared \
     -v /var/log:/var/log \
     --name nvidia-driver \
     registry.gitlab.com/container-toolkit-fcos/driver:${DRIVER_VERSION}-fedora$$FEDORA_VERSION_ID

but the container automatically changes to an 'exited' state, and it remains the same even after restarting. What should I do?
image

@fifofonix
Copy link
Contributor

I expect that if you were to examine the podman logs for the container above you will see that there are compilation errors (difficult to see because there are a lot of expected warnings).

These errors occur with Fedora40 and won't be officially addressed by NVIDIA until they incorporate a patch they have advertised on their community forum. I have this patch applied on my fleet and it works fine but applying it is tricky.

See: https://gitlab.com/container-toolkit-fcos/driver/-/issues/11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants