Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nvidia-container-toolkit #378

Open
MorphSeur opened this issue Feb 27, 2024 · 8 comments
Open

Nvidia-container-toolkit #378

MorphSeur opened this issue Feb 27, 2024 · 8 comments

Comments

@MorphSeur
Copy link

MorphSeur commented Feb 27, 2024

Hello!

After a careful follow of the installation guide of NVIDIA Container Toolkit, a docker image is unable to use nvidia runtime.

  • Here is the output of grep nvidia /etc/apt/sources.list.d/*.list.
/etc/apt/sources.list.d/cuda-ubuntu2204-x86_64.list:deb [signed-by=/usr/share/keyrings/cuda-archive-keyring.gpg] https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /
/etc/apt/sources.list.d/nvidia-container-toolkit.list:deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/stable/deb/$(ARCH) /
/etc/apt/sources.list.d/nvidia-container-toolkit.list:deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/experimental/deb/$(ARCH) /
  • Then, after configuring docker, here is some output related docker daemon
    • Docker image with nvidia runtime.
ubuntu@ubuntu:~$ sudo nvidia-ctk runtime configure --runtime=docker
INFO[0000] Loading config from /etc/docker/daemon.json  
INFO[0000] Wrote updated config to /etc/docker/daemon.json 
INFO[0000] It is recommended that docker daemon be restarted. 
ubuntu@ubuntu:~$ cat /etc/docker/daemon.json
{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}
ubuntu@ubuntu:~$sudo systemctl restart docker
ubuntu@ubuntu:~$ docker run --runtime=nvidia --gpus 1 --rm -v ./models/:/models -v ./audios:/audios -v ./outputs:/outputs cuda-app:latest
docker: Error response from daemon: unknown or invalid runtime name: nvidia.
See 'docker run --help'.
  • Docker image without nvidia runtime
ubuntu@ubuntu:~$ docker run --gpus 1 --gpus 1 --rm -v ./models/:/models -v ./audios:/audios -v ./outputs:/outputs cuda-app:latest
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
  • Here is some additional details:
ubuntu@ubuntu:~$ docker images
REPOSITORY    TAG                       IMAGE ID       CREATED         SIZE
ubuntu        latest                    3db8720ecbf5   2 weeks ago     77.9MB
cuda-app      latest                    0978724b7806   3 weeks ago     2.75GB
nvidia/cuda   12.3.1-base-ubuntu20.04   d13839a3c4fb   2 months ago    246MB
hello-world   latest                    d2c94e258dcb   10 months ago   13.3kB
ubuntu@ubuntu:~$ docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.3.1-base-ubuntu20.04 nvidia-smi
docker: Error response from daemon: unknown or invalid runtime name: nvidia.
See 'docker run --help'.
ubuntu@ubuntu:~$ docker --version
Docker version 24.0.7, build afdd53b
ubuntu@ubuntu:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Sep__8_19:17:24_PDT_2023
Cuda compilation tools, release 12.3, V12.3.52
Build cuda_12.3.r12.3/compiler.33281558_0
ubuntu@ubuntu:~$ nvidia-smi
Tue Feb 27 15:23:00 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        On  | 00000000:04:00.0 Off |                  N/A |
| 34%   25C    P8              12W / 350W |      1MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  | 00000000:07:00.0 Off |                  N/A |
|  0%   45C    P8              26W / 350W |      1MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Thanks a lot for your help in this issue!

@elezar
Copy link
Member

elezar commented Feb 27, 2024

Hi. How was Docker installed? Is this Docker desktop, or docker engine?

@MorphSeur
Copy link
Author

MorphSeur commented Feb 27, 2024

Thanks for your reply!

The installation is a Docker Engine.

  • Additional information:
    • the docker images was working using nvidia-container-toolkit since 1 month until this morning.
    • someone updated the nvidia-driver from 12.1 to 12.2.

@elezar
Copy link
Member

elezar commented Mar 5, 2024

I would say that the primary issue is that you're not able to configure the nvidia runtime for your docker installation. It could be that the config file is not being used and that arguments to the docker daemon are being used instead. Could you confirm whether this is the case?

@MorphSeur
Copy link
Author

Thanks a lot for your reply.

The nvidia-runtime is set correctly following the documentation,
May I know which config file? Is it /etc/nvidia-container-runtime/config.toml?

@elezar
Copy link
Member

elezar commented Mar 5, 2024

The documentation is valid if the Docker daemon is using the /etc/docker/daemon.json config file. If the daemon is configured through another mechanism or uses a different config file, the instructions need to be adapted. How is your docker daemon configured?

@MorphSeur
Copy link
Author

Thanks for your reply.

The daemon is configured using: sudo nvidia-ctk runtime configure --runtime=docker
Here is the file:

{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}

@elezar
Copy link
Member

elezar commented Mar 6, 2024

@MorphSeur if this config was being applied, then the following error would not be triggered:

docker: Error response from daemon: unknown or invalid runtime name: nvidia.

This is triggered repeatedly for all your examples indicating that the runtime is not being configured correctly. To address this we would have to understand what is non-standard about your docker installation. Do the docker daemon logs (journalctl -xu docker.service) show any messages related to the config or the runtimes when the daemon is (re)started?

Are you perhaps running a rootless docker so that the instructions from https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#rootless-mode may need to be followed?

@MorphSeur
Copy link
Author

Yes, the docker deamon logs contain errors related to runtime; they are similar as the error above.

Feb 27 11:11:45 ubuntu dockerd[949569]: time="2024-02-27T11:11:45.380223199+01:00" level=error msg="stream copy error: reading from a closed fifo"
Feb 27 11:11:45 ubuntu dockerd[949569]: time="2024-02-27T11:11:45.380234811+01:00" level=error msg="stream copy error: reading from a closed fifo"
Feb 27 11:11:45 ubuntu dockerd[949569]: time="2024-02-27T11:11:45.487476417+01:00" level=error msg="Handler for POST /v1.43/containers/87f622a9e91d4ff977b7684e279badd8345e63ed91ffa10923fa34080754d593/start returned error: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\nnvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown"
Feb 27 11:11:50 ubuntu dockerd[949569]: time="2024-02-27T11:11:50.315445730+01:00" level=error msg="stream copy error: reading from a closed fifo"
Feb 27 11:11:50 ubuntu dockerd[949569]: time="2024-02-27T11:11:50.315497738+01:00" level=error msg="stream copy error: reading from a closed fifo"
Feb 27 11:11:50 ubuntu dockerd[949569]: time="2024-02-27T11:11:50.419085301+01:00" level=error msg="Handler for POST /v1.43/containers/0f60f75fd2c4b678fa25aa43b0323a4822d295acade31e2c29ee4670e66d58cb/start returned error: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\nnvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown"
Feb 27 11:21:51 ubuntu dockerd[949569]: time="2024-02-27T11:21:51.244479007+01:00" level=error msg="stream copy error: reading from a closed fifo"
Feb 27 11:21:51 ubuntu dockerd[949569]: time="2024-02-27T11:21:51.244519553+01:00" level=error msg="stream copy error: reading from a closed fifo"
Feb 27 11:21:51 ubuntu dockerd[949569]: time="2024-02-27T11:21:51.434034408+01:00" level=error msg="Handler for POST /v1.43/containers/0b3069f08e0eb41d9c0f5967fa113c07fa5d27d72792f3b4a4d05e57a8225851/start returned error: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\nnvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown"
Feb 27 12:01:05 ubuntu systemd[1]: Stopping Docker Application Container Engine...

Regarding the docker, it is running with root.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants