Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WSL NixOS cdi generate Error: failed to initialize dxcore context #452

Open
Samiser opened this issue Apr 10, 2024 · 3 comments
Open

WSL NixOS cdi generate Error: failed to initialize dxcore context #452

Samiser opened this issue Apr 10, 2024 · 3 comments

Comments

@Samiser
Copy link

Samiser commented Apr 10, 2024

❯ nvidia-ctk --version
NVIDIA Container Toolkit CLI version 1.15.0-rc.3

i'm attempting to use nvidia-ctk to generate a CDI spec in WSL running NixOS, but am getting the following error:

❯ nvidia-ctk cdi generate --nvidia-ctk-path /run/current-system/sw/bin/nvidia-ctk --ldconfig-path /run/current-system/sw/bin/ldconfig --mode wsl
INFO[0000] Selecting /dev/dxg as /dev/dxg
ERRO[0000] failed to generate CDI spec: failed to create edits common for entities: failed to create discoverer for WSL driver: failed to initialize dxcore: failed to initialize dxcore context

if i generate the CDI spec on a different VM and use that config directly (only changing the location of nvidia-ctk) then nvidia-ctk successfully finds the device and i can use it in containers:

nvidia-container-toolkit.json (click to expand)
{
    "cdiVersion": "0.3.0",
    "containerEdits": {
        "hooks": [
            {
                "args": [
                    "nvidia-ctk",
                    "hook",
                    "create-symlinks",
                    "--link",
                    "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/nvidia-smi::/usr/bin/nvidia-smi"
                ],
                "hookName": "createContainer",
                "path": "/run/current-system/sw/bin/nvidia-ctk"
            },
            {
                "args": [
                    "nvidia-ctk",
                    "hook",
                    "update-ldcache",
                    "--folder",
                    "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69",
                    "--folder",
                    "/usr/lib/wsl/lib"
                ],
                "hookName": "createContainer",
                "path": "/run/current-system/sw/bin/nvidia-ctk"
            }
        ],
        "mounts": [
            {
                "containerPath": "/usr/lib/wsl/lib/libdxcore.so",
                "hostPath": "/usr/lib/wsl/lib/libdxcore.so",
                "options": [
                    "ro",
                    "nosuid",
                    "nodev",
                    "bind"
                ]
            },
            {
                "containerPath": "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/libcuda.so.1.1",
                "hostPath": "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/libcuda.so.1.1",
                "options": [
                    "ro",
                    "nosuid",
                    "nodev",
                    "bind"
                ]
            },
            {
                "containerPath": "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/libcuda_loader.so",
                "hostPath": "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/libcuda_loader.so",
                "options": [
                    "ro",
                    "nosuid",
                    "nodev",
                    "bind"
                ]
            },
            {
                "containerPath": "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/libnvidia-ml.so.1",
                "hostPath": "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/libnvidia-ml.so.1",
                "options": [
                    "ro",
                    "nosuid",
                    "nodev",
                    "bind"
                ]
            },
            {
                "containerPath": "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/libnvidia-ml_loader.so",
                "hostPath": "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/libnvidia-ml_loader.so",
                "options": [
                    "ro",
                    "nosuid",
                    "nodev",
                    "bind"
                ]
            },
            {
                "containerPath": "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/libnvidia-ptxjitcompiler.so.1",
                "hostPath": "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/libnvidia-ptxjitcompiler.so.1",
                "options": [
                    "ro",
                    "nosuid",
                    "nodev",
                    "bind"
                ]
            },
            {
                "containerPath": "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/nvcubins.bin",
                "hostPath": "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/nvcubins.bin",
                "options": [
                    "ro",
                    "nosuid",
                    "nodev",
                    "bind"
                ]
            },
            {
                "containerPath": "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/nvidia-smi",
                "hostPath": "/usr/lib/wsl/drivers/nv_dispig.inf_amd64_1fea8972dc2f0a69/nvidia-smi",
                "options": [
                    "ro",
                    "nosuid",
                    "nodev",
                    "bind"
                ]
            }
        ]
    },
    "devices": [
        {
            "containerEdits": {
                "deviceNodes": [
                    {
                        "path": "/dev/dxg"
                    }
                ]
            },
            "name": "all"
        }
    ],
    "kind": "nvidia.com/gpu"
}

i've also tried populating every other flag with the locations of the files in /usr/lib/wsl/ but that didn't make a difference, i assume that's handled by --mode wsl

here's the relevant nix config if it helps (ommitting nixos-wsl import section):

{
  wsl.enable = true;

  environment.systemPackages = with pkgs; [ nvidia-container-toolkit ];

  virtualisation.podman.enable = true;
  virtualisation.containers.cdi.dynamic.nvidia.enable = true;

  programs.nix-ld.enable = true;

  environment.variables = lib.mkForce {
    NIX_LD_LIBRARY_PATH = "/usr/lib/wsl/lib/";
    NIX_LD = "${pkgs.glibc}/lib/ld-linux-x86-64.so.2";
  };
}

and here's the gpu working with the manual config:

❯ nvidia-ctk cdi list
INFO[0000] Found 1 CDI devices
nvidia.com/gpu=all

❯ podman run --device nvidia.com/gpu=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -benchmark -gpu
--- cut ---
> Compute 7.5 CUDA device: [NVIDIA GeForce RTX 2070]
36864 bodies, total time for 10 iterations: 65.098 ms
= 208.756 billion interactions per second
= 4175.121 single-precision GFLOP/s at 20 flops per interaction

let me know if there's any more information i can provide!

@elezar
Copy link
Member

elezar commented Apr 12, 2024

@Samiser could you run:

nvidia-ctk --debug cdi generate

I assume that the utility does not find libdxcore.so by itself, meaning that the mode needs to be explicitly set.

Note that we do use dlopen to load libdxcore.so, so you could try setting LD_PRELOAD=${PATH_TO_LIB}/libdxcore.so explicitly. This should help both the autodetection and the generation.

I would have to look at how to make this more robust.

@Samiser
Copy link
Author

Samiser commented Apr 12, 2024

sure, here is the debug output both with --mode wsl and without:

~
❯ nvidia-ctk --debug cdi generate --mode wsl
DEBU[0000] Locating NVIDIA Container Toolkit CLI as nvidia-ctk
DEBU[0000] Locating "nvidia-ctk" in [/run/wrappers/bin /home/sam/.nix-profile/bin /nix/profile/bin /home/sam/.local/state/nix/profile/bin /etc/profiles/per-user/sam/bin /nix/var/nix/profiles/default/bin /run/current-system/sw/bin /home/sam/bin /usr/local/sbin /usr/local/bin /usr/sbin /usr/bin /sbin /bin]
DEBU[0000] Checking candidate '/run/current-system/sw/bin/nvidia-ctk'
DEBU[0000] Found 1 candidates; ignoring further candidates
DEBU[0000] Found nvidia-ctk candidates: [/run/current-system/sw/bin/nvidia-ctk]
DEBU[0000] Using NVIDIA Container Toolkit CLI path nvidia-ctk
DEBU[0000] Locating /dev/dxg
DEBU[0000] Locating "/dev/dxg" in [/ /dev]
DEBU[0000] Checking candidate '/dev/dxg'
DEBU[0000] Located /dev/dxg as [/dev/dxg]
INFO[0000] Selecting /dev/dxg as /dev/dxg
ERRO[0000] failed to generate CDI spec: failed to create edits common for entities: failed to create discoverer for WSL driver: failed to initialize dxcore: failed to initialize dxcore context

~
❯ nvidia-ctk --debug cdi generate
DEBU[0000] Locating NVIDIA Container Toolkit CLI as nvidia-ctk
DEBU[0000] Locating "nvidia-ctk" in [/run/wrappers/bin /home/sam/.nix-profile/bin /nix/profile/bin /home/sam/.local/state/nix/profile/bin /etc/profiles/per-user/sam/bin /nix/var/nix/profiles/default/bin /run/current-system/sw/bin /home/sam/bin /usr/local/sbin /usr/local/bin /usr/sbin /usr/bin /sbin /bin]
DEBU[0000] Checking candidate '/run/current-system/sw/bin/nvidia-ctk'
DEBU[0000] Found 1 candidates; ignoring further candidates
DEBU[0000] Found nvidia-ctk candidates: [/run/current-system/sw/bin/nvidia-ctk]
DEBU[0000] Using NVIDIA Container Toolkit CLI path nvidia-ctk
DEBU[0000] Is WSL-based system? false: could not load DXCore library: libdxcore.so: cannot open shared object file: No such file or directory
DEBU[0000] Is NVML-based system? false: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
DEBU[0000] Is Tegra-based system? false: /sys/devices/soc0/family file not found
INFO[0000] Auto-detected mode as "nvml"
ERRO[0000] failed to generate CDI spec: failed to create device CDI specs: failed to initialize NVML: ERROR_LIBRARY_NOT_FOUND

also, setting LD_PRELOAD doesn't seem to help:

~
❯ ls /usr/lib/wsl/lib/libdxcore.so
/usr/lib/wsl/lib/libdxcore.so

~
❯ LD_PRELOAD=/usr/lib/wsl/lib/libdxcore.so nvidia-ctk --debug cdi generate --mode wsl
DEBU[0000] Locating NVIDIA Container Toolkit CLI as nvidia-ctk
DEBU[0000] Locating "nvidia-ctk" in [/run/wrappers/bin /home/sam/.nix-profile/bin /nix/profile/bin /home/sam/.local/state/nix/profile/bin /etc/profiles/per-user/sam/bin /nix/var/nix/profiles/default/bin /run/current-system/sw/bin /home/sam/bin /usr/local/sbin /usr/local/bin /usr/sbin /usr/bin /sbin /bin]
DEBU[0000] Checking candidate '/run/current-system/sw/bin/nvidia-ctk'
DEBU[0000] Found 1 candidates; ignoring further candidates
DEBU[0000] Found nvidia-ctk candidates: [/run/current-system/sw/bin/nvidia-ctk]
DEBU[0000] Using NVIDIA Container Toolkit CLI path nvidia-ctk
DEBU[0000] Locating /dev/dxg
DEBU[0000] Locating "/dev/dxg" in [/ /dev]
DEBU[0000] Checking candidate '/dev/dxg'
DEBU[0000] Located /dev/dxg as [/dev/dxg]
INFO[0000] Selecting /dev/dxg as /dev/dxg
ERRO[0000] failed to generate CDI spec: failed to create edits common for entities: failed to create discoverer for WSL driver: failed to initialize dxcore: failed to initialize dxcore context

also using the flag --library-search-path doesn't seem to help either:

~
❯ nvidia-ctk --debug cdi generate --mode wsl --library-search-path /usr/lib/wsl/lib/
DEBU[0000] Locating NVIDIA Container Toolkit CLI as nvidia-ctk
DEBU[0000] Locating "nvidia-ctk" in [/run/wrappers/bin /home/sam/.nix-profile/bin /nix/profile/bin /home/sam/.local/state/nix/profile/bin /etc/profiles/per-user/sam/bin /nix/var/nix/profiles/default/bin /run/current-system/sw/bin /home/sam/bin /usr/local/sbin /usr/local/bin /usr/sbin /usr/bin /sbin /bin]
DEBU[0000] Checking candidate '/run/current-system/sw/bin/nvidia-ctk'
DEBU[0000] Found 1 candidates; ignoring further candidates
DEBU[0000] Found nvidia-ctk candidates: [/run/current-system/sw/bin/nvidia-ctk]
DEBU[0000] Using NVIDIA Container Toolkit CLI path nvidia-ctk
DEBU[0000] Locating /dev/dxg
DEBU[0000] Locating "/dev/dxg" in [/ /dev]
DEBU[0000] Checking candidate '/dev/dxg'
DEBU[0000] Located /dev/dxg as [/dev/dxg]
INFO[0000] Selecting /dev/dxg as /dev/dxg
ERRO[0000] failed to generate CDI spec: failed to create edits common for entities: failed to create discoverer for WSL driver: failed to initialize dxcore: failed to initialize dxcore context

@loicreynier
Copy link

As mentioned in NixOS/nixpkgs#312253 and nix-community/NixOS-WSL#433, you should either use the wsl.useWindowsDriver option from NixOS-WSL or use LD_LIBRARY_PATH=/usr/lib/wsl/lib when generating the CDI.

ereslibre added a commit to ereslibre/NixOS-WSL that referenced this issue May 20, 2024
This improves the user experience as whenever the user enables the
`config.hardware.nvidia-container-toolkit.enable` option, they cannot
use their Nvidia GPU's within the Docker containers because of missing
libraries.

This gets fixed by setting `wsl.useWindowsDriver` explicitly when the
user requests to enable GPU support on Docker containers.

Issue and fix provided by @qwqawawow

Related: nix-community#433
Related: NVIDIA/nvidia-container-toolkit#452
ereslibre added a commit to ereslibre/NixOS-WSL that referenced this issue May 20, 2024
This improves the user experience as whenever the user enables the
`config.hardware.nvidia-container-toolkit.enable` option, they cannot
use their Nvidia GPU's within the Docker containers because of missing
libraries.

This gets fixed by setting `wsl.useWindowsDriver` explicitly when the
user requests to enable GPU support on Docker containers.

Issue and fix provided by @qwqawawow

Related: nix-community#433
Related: NVIDIA/nvidia-container-toolkit#452
ereslibre added a commit to ereslibre/NixOS-WSL that referenced this issue May 20, 2024
This improves the user experience as whenever the user enables the
`config.hardware.nvidia-container-toolkit.enable` option, they cannot
use their Nvidia GPU's within the Docker containers because of missing
libraries.

This gets fixed by setting `wsl.useWindowsDriver` explicitly when the
user requests to enable GPU support on Docker containers.

Issue and fix provided by @qwqawawow

Related: nix-community#433
Related: NVIDIA/nvidia-container-toolkit#452
ereslibre added a commit to ereslibre/NixOS-WSL that referenced this issue May 20, 2024
This improves the user experience as whenever the user enables the
`config.hardware.nvidia-container-toolkit.enable` option, they cannot
use their Nvidia GPU's within the Docker containers because of missing
libraries.

This gets fixed by setting `wsl.useWindowsDriver` explicitly when the
user requests to enable GPU support on Docker containers.

Issue and fix provided by @qwqawawow

Related: nix-community#433
Related: NVIDIA/nvidia-container-toolkit#452
ereslibre added a commit to ereslibre/NixOS-WSL that referenced this issue May 20, 2024
This improves the user experience as whenever the user enables the
`config.hardware.nvidia-container-toolkit.enable` option, they cannot
use their Nvidia GPU's within the Docker containers because of missing
libraries.

This gets fixed by setting `wsl.useWindowsDriver` explicitly when the
user requests to enable GPU support on Docker containers.

Issue and fix provided by @qwqawawow

Related: nix-community#433
Related: NVIDIA/nvidia-container-toolkit#452
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants