Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble Running NVIDIA GPU Containers on Custom Yocto-Based Distro on HPE Server with NVIDIA A40 GPU #424

Open
Nauman3S opened this issue Mar 23, 2024 · 0 comments

Comments

@Nauman3S
Copy link

I'm experiencing difficulties running NVIDIA GPU containers on a custom Yocto-based distribution tailored for an HPE server equipped with an NVIDIA A40 GPU. Despite having set up a custom meta-nvidia layer (mickledore branch), which includes recipes for NVIDIA drivers, libnvidia-container, libtirpc, and nvidia-container-toolkit (based on meta-tegra's recipes-containers layer at OE4T/meta-tegra), I encounter errors when attempting to run containers that utilize the GPU.

Distro Details:

Distro: poky Included Recipes and Layers: containerd, virtualization layers, NVIDIA drivers and kernel modules, systemd, kernel headers, etc.

Issue Reproduction Steps:

Configuring the container runtime:

sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd

Pulling images for testing:

sudo ctr images pull docker.io/nvidia/cuda:12.0.0-runtime-ubuntu20.04
sudo ctr images pull docker.io/nvidia/cuda:12.0.0-runtime-ubi8
sudo ctr images pull docker.io/nvidia/cuda:12.0.0-base-ubuntu20.04
sudo ctr images pull docker.io/nvidia/cuda:12.0.0-base-ubi8
Running a container with GPU:

sudo ctr run --rm --gpus 0 --runtime io.containerd.runc.v1 --privileged docker.io/nvidia/cuda:12.0.0-runtime-ubuntu20.04 test nvidia-smi

Error Message:

ctr: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: ldcache error: process /sbin/ldconfig.real failed with error code: 1: unknown
This error persists across all pulled NVIDIA images(non-ubuntu based images show the same error but with /sbin/ldconfig instead of /sbin/ldconfig.real. However, non-GPU containers (e.g., docker.io/macabees/neofetch:latest) work without issues.

Further Details:

Running ldconfig -p shows 264 libs found, including various NVIDIA libraries while running ldconfig outputs no error.

uname -a: Kernel version: Linux intel-corei7-64-02 6.1.38-intel-pk-standard #1 SMP PREEMPT_DYNAMIC Thu Jul 13 04:53:52 UTC 2023 x86_64 GNU/Linux

Output from sudo nvidia-container-cli -k -d /dev/tty info includes warnings about missing libraries and compat32 libraries, although nvidia-smi shows the GPU is recognized correctly.

Attempted Solutions:

Verifying all NVIDIA driver and toolkit components are correctly installed. Ensuring the ldconfig cache is current and includes paths to the NVIDIA libraries and /sbin/ldconfig.real is a symlink to /sbin/ldconfig.

Despite these efforts, the error persists, and GPU containers fail to start. I'm seeking advice on resolving this ldcache and container initialization error to run NVIDIA GPU containers on this custom Yocto-based distribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant