Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.5.0 Create CDI Spec Could not find, Could not locate on COS Triage #473

Open
Dragoncell opened this issue Apr 23, 2024 · 4 comments
Open
Assignees

Comments

@Dragoncell
Copy link

Dragoncell commented Apr 23, 2024

Setup:

With custom change of GPU Operator
NVIDIA/gpu-operator@master...Dragoncell:gpu-operator:master-gke

Using below command to install the GPU Operator using CDI enabled with COS installed GPU driver

helm upgrade -i --create-namespace --namespace gpu-operator noperator deployments/gpu-operator --set driver.enabled=false --set cdi.enabled=true --set cdi.default=true --set operator.runtimeClass=nvidia-cdi --set hostRoot=/ --set driverRoot=/home/kubernetes/bin/nvidia --set devRoot=/ --set operator.repository=gcr.io/jiamingxu-gke-dev --set operator.version=v0422_04 --set toolkit.installDir=/home/kubernetes/bin/nvidia --set toolkit.repository=gcr.io/jiamingxu-gke-dev  --set toolkit.version=v4 --set validator.repository=gcr.io/jiamingxu-gke-dev --set validator.version=v0417_1 --set devicePlugin.version=v0422_4 --set devicePlugin.repository=gcr.io/jiamingxu-gke-dev

During the CDI creation either in toolkit container for management cdi spec, or in k8s device plugin for workload cdi spec, there are a few warning level logs.

Both:

  1. Could not find ld.so.cache
time="2024-04-22T19:37:03Z" level=warning msg="Could not find ld.so.cache at /host/home/kubernetes/bin/nvidia/etc/ld.so.cache; creating empty cache"
time="2024-04-22T19:37:03Z" level=info msg="Using driver version 535.129.03"
time="2024-04-22T19:37:03Z" level=warning msg="Could not find ld.so.cache at /host/home/kubernetes/bin/nvidia/etc/ld.so.cache; creating empty cache"
  1. Feature related stuff
time="2024-04-22T19:37:03Z" level=warning msg="Could not locate /nvidia-persistenced/socket: pattern /nvidia-persistenced/socket not found"
time="2024-04-22T19:37:03Z" level=warning msg="Could not locate /nvidia-fabricmanager/socket: pattern /nvidia-fabricmanager/socket not found"
time="2024-04-22T19:37:03Z" level=warning msg="Could not locate /tmp/nvidia-mps: pattern /tmp/nvidia-mps not found"
time="2024-04-22T19:37:03Z" level=warning msg="Could not locate nvidia/535.129.03/gsp*.bin: pattern nvidia/535.129.03/gsp*.bin not found"

k8s device plugin only

time="2024-04-22T19:37:22Z" level=warning msg="Could not locate glvnd/egl_vendor.d/10_nvidia.json: pattern glvnd/egl_vendor.d/10_nvidia.json not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate vulkan/icd.d/nvidia_icd.json: pattern vulkan/icd.d/nvidia_icd.json not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate vulkan/icd.d/nvidia_layers.json: pattern vulkan/icd.d/nvidia_layers.json not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate vulkan/implicit_layer.d/nvidia_layers.json: pattern vulkan/implicit_layer.d/nvidia_layers.json not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate egl/egl_external_platform.d/15_nvidia_gbm.json: pattern egl/egl_external_platform.d/15_nvidia_gbm.json not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate egl/egl_external_platform.d/10_nvidia_wayland.json: pattern egl/egl_external_platform.d/10_nvidia_wayland.json not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate nvidia/nvoptix.bin: pattern nvidia/nvoptix.bin not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate nvidia/xorg/nvidia_drv.so: pattern nvidia/xorg/nvidia_drv.so not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate nvidia/xorg/libglxserver_nvidia.so.535.129.03: pattern nvidia/xorg/libglxserver_nvidia.so.535.129.03 not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate X11/xorg.conf.d/10-nvidia.conf: pattern X11/xorg.conf.d/10-nvidia.conf not found"
....
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate nvidia/xorg/nvidia_drv.so: pattern nvidia/xorg/nvidia_drv.so not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate nvidia/xorg/libglxserver_nvidia.so.535.129.03: pattern nvidia/xorg/libglxserver_nvidia.so.535.129.03 not found"

Wondering is there any warning worth further investigation ? For example vulkan/icd.d/nvidia_icd.json, it is actually under like

/home/kubernetes/bin/nvidia/vulkan/icd.d $ ls
nvidia_icd.json
@elezar
Copy link
Member

elezar commented Apr 23, 2024

Could not find ld.so.cache

These warnings can safely be ignored. Using the ldcache is intended for interactive use cases where the driver libraries have already been loaded. I will check whether we can make the logging more useful though.

Feature related stuff

  • IPC sockets: I assume in your case, these sockets are also rooted at / and not at /host/home/kubernetes? The detection logic would have to be updated to properly locate these.
  • gsp*.bin: we do support custom firmware paths and I would have to check why this is not working as expected. I assume it is because we're assuming a hostDriverRoot prefix here. It should be noted that it has been pointed out internally that it should not be needed to inject the firmware and that we can ignore this warning.

k8s device plugin only

For the graphics libraries, we would have to confirm that these are actually present in the driver installation rooted at /host/home/kubernetes. For nvidia_icd.json specifically we search /etc, /usr/local/share, /usr/share by default for vulkan/icd.d/nvidia_icd.json explicitly. This definitely means that we're missing the /home/kubernetes/bin/nvidia/vulkan/icd.d/nvidia_icd.json file. Could you provide a list of the files installed by your driver container?

@elezar elezar self-assigned this Apr 23, 2024
@Dragoncell
Copy link
Author

Dragoncell commented Apr 23, 2024

  • IPC sockets:
    COS does not run the fabricmanager or persistenced daemon automatically so this is okay
  • gsp*.bin: it is present on host like below path
    /home/kubernetes/bin/nvidia/firmware/nvidia/535.129.03 $ ls
    gsp_ga10x.bin  gsp_tu10x.bin
    

list of files installed under /home/kubenetes/bin/nvidia, so seems like COS only install vulkan for now

 /home/kubernetes/bin/nvidia $ ls
NVIDIA-Linux-x86_64-535.129.03.run  bin-workdir  drivers-workdir  lib64          nvidia-drivers-535.129.03.tgz  share    vulkan
bin                                 drivers      firmware         lib64-workdir  nvidia-installer.log           toolkit

so from this triage, we have two actions:

  1. [Optional] gsp*.bin: investigate it why this is not working as expected
  2. Add driverRoot path search for file vulkan/icd.d/nvidia_icd.json

@elezar
Copy link
Member

elezar commented Apr 24, 2024

The gsp firmware warnings could be addressed by #317 (although it would have to be rebased and reworked).

@elezar
Copy link
Member

elezar commented Jun 6, 2024

@Dragoncell I have created #529 with a proposed change to locate the Vulkan ICD files as availalbe in the GKE driver.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants