GPU integration with nanos #1621

zeroecco · 2024-05-13T16:05:27Z

Hello!

I am attempting to get a working unikernel on my workstation (going through this blog: https://nanovms.com/dev/tutorials/gpu-accelerated-computing-nanos-unikernels) but running into a number of hurdles I thought I should document and ask for assistance on:

First thing I came across: I cannot build the klib on main. I hurdled this by checking out 0.1.50 of nanos
Second thing: The gpu repo was updated to support nvidia driver version 535, but there are two bin files (gsp_ga10x.bin, and gsp_tu10x.bin), I copied both but not sure if that was the right choice.
Third thing: Following the guide, the ops config is wrong for current code. "Klibs": ["gpu_nvidia"], needs to be outside of the run config based on the ops docs (ops also complained about the config being wrong).
Fourth thing and where I am currently stuck: ops bombs out immediately saying invalid GPU type. Not sure where to look from here on what I am doing wrong. Any debugging steps I should take from here?

Here is the current output:

ops run -c ops.config main
running local instance
booting /root/.ops/images/main ...
Invalid GPU type 'nvidia-tesla-t4'
cat ops.config
{
  "RunConfig": {
    "GPUs": 1,
    "GPUType": "nvidia-tesla-t4"
  },
  "Klibs": ["gpu_nvidia"],
  "Dirs": ["nvidia"]
}

The text was updated successfully, but these errors were encountered:

eyberg · 2024-05-13T23:10:56Z

that article was written a while ago

are you trying to run this locally or in the cloud? if local - there is additional work that one has to use it locally: #1528 - the older article you linked was for gcp specifically, (we have an outstanding task to document the onprem setup nanovms/ops-documentation#430 )

francescolavra · 2024-05-14T07:17:32Z

* First thing I came across: I cannot build the klib on main. I hurdled this by checking out 0.1.50 of nanos

There has indeed been a recent change (in nanovms/nanos#2011) in the nanos interrupt API, and the nvidia klib hasn't been updated yet to adapt to that change. If you want, you can check out the kernel version prior to that PR and build the klib against that. Also please note that in order to be able to build the klib you have to build nanos itself first

* Second thing: The gpu repo was updated to support nvidia driver version 535, but there are two bin files (gsp_ga10x.bin, and gsp_tu10x.bin), I copied both but not sure if that was the right choice.

Copying both is fine, the driver will pick the right one depending on which GPU type it detects

* Fourth thing and where I am currently stuck: ops bombs out immediately saying invalid GPU type. Not sure where to look from here on what I am doing wrong. Any debugging steps I should take from here?

The only "GPUType" you can set in the config when running locally is "pci-passthrough" (but you can just omit the "GPUType" option altogether, since pci-passthrough is the default setting). This will detect the GPU(s) connected to the PCI bus of your machine, and should work with any supported Nvidia GPU type.

rinor · 2024-05-14T16:28:32Z

last time I checked the build, the only change I made was:

diff --git a/kernel-open/nvidia/nv-msi.c b/kernel-open/nvidia/nv-msi.c
index 020ef53..a0c2be9 100644
--- a/kernel-open/nvidia/nv-msi.c
+++ b/kernel-open/nvidia/nv-msi.c
@@ -55,7 +55,8 @@ void NV_API_CALL nv_init_msi(nv_state_t *nv)
         }
         else
         {
-            msi_format(&address, &data, nv->interrupt_line);
+            u32 target_cpu = irq_get_target_cpu(irange(0, 0));
+            msi_format(&address, &data, nv->interrupt_line, target_cpu);
             pci_cfgwrite(dev, cp + 4, 4, address);    /* address low */
             pci_cfgwrite(dev, cp + 8, 4, 0);          /* address high */
             pci_cfgwrite(dev, cp + 12, 4, data);      /* data */

can't confirm that it is correct, just that it builds fine with the latest nanos.

francescolavra · 2024-05-15T07:17:29Z

Yes, that is a correct change. Thanks

zeroecco · 2024-05-15T15:31:08Z

thanks for all this feedback! I will try it and let you know ASAP

francescolavra · 2024-05-15T17:06:31Z

nanovms/gpu-nvidia#5 has been merged in our gpu-nvidia repository, so the klib now builds successfully against the master branch of nanos.

zeroecco · 2024-05-15T17:51:13Z

closer:

root@north:~/r0uk# ops run -c ops.config main
running local instance
booting /root/.ops/images/main ...
en1: assigned 10.0.2.15
NVRM _sysCreateOs: RM Access Sys Cap creation failed: 0x56
NVRM cpuidInfoAMD: Unrecognized AMD processor in cpuidInfoAMD
NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  535.113.01  Release Build  (root@north)  Mon May 13 02:04:21 AM UTC 2024
Loaded the UVM driver, major device number 0.
2024/05/15 17:50:16 Listening...on 8080
en1: assigned FE80::30A6:AEFF:FE3E:B03D
^Cqemu-system-x86_64: terminating on signal 2
signal: killed
root@north:~/r0uk#

francescolavra · 2024-05-15T18:29:33Z

As written in the tutorial, the line "Loaded the UVM driver, major device number 0" indicates that the GPU klib was loaded successfully, and the GPU attached to your instance is available for your application to use. Are you facing any issues?

zeroecco · 2024-05-15T23:32:36Z

not anymore on the nightly, thanks for your guidance

0x5459 · 2024-06-03T17:09:56Z

I am getting the following error (GeForce RTX 3080):

en1: assigned 10.0.2.15
NVRM _sysCreateOs: RM Access Sys Cap creation failed: 0x56
NVRM: failed to register character device.
klib automatic load failed (4)

francescolavra · 2024-06-04T05:20:32Z

The above error means the klib failed to create the /dev/nvidiactl file which is used by the userspace nvidia drivers to interface with the GPU. @0x5459 is there anything already at that path in the image you are using?
How are you starting the Nanos instance? If you are using Ops, can you share your command line and your json configuration file?

0x5459 · 2024-06-04T05:32:49Z

I suspect that the inconsistency between my CUDA version and driver version is causing the issue. My program is compiled with CUDA 11. Now, I am trying to install CUDA 12. I will reply here with any updates.

My config:

{
  "RebootOnExit": true,
  "ManifestPassthrough": {
    "readonly_rootfs": "true"
  },
  "Env": {
    "RUST_BACKTRACE": "1",
    "RUST_LOG": "debug",
  },
  "Program": "c2-test",
  "KlibDir": "/root/code/gpu-nvidia/kernel-open/_out/Nanos_x86_64",
  "Klibs": ["gpu_nvidia"],
  "Dirs": ["nvidia"],
  "Mounts": {
    "/root/dataset": "/dataset"
  },
  "RunConfig": {
    "CPUs": 32,
    "Memory": "64g",
    "GPUs": 1
  }
}

0x5459 · 2024-06-04T05:46:44Z

I have tried to compile my program with cuda12.2 but still get the same error. Could you give me some help? @francescolavra

francescolavra · 2024-06-04T06:03:55Z

The problem is in the fact that your root filesystem is being configured as read-only (via the "readonly_rootfs": "true" option in your config). This prevents the klib from creating the /dev/nvidiactl file, and that causes the "failed to register character device" error.

0x5459 · 2024-06-04T15:11:41Z

@francescolavra Hi, I have a new issue.

I built deviceQuery program from cuda-samples using following config:

{
  "Program": "deviceQuery",
  "KlibDir": "/root/code/gpu-nvidia/kernel-open/_out/Nanos_x86_64",
  "Klibs": ["gpu_nvidia"],
  "Dirs": ["nvidia"],
  "RunConfig": {
    "GPUs": 1
  }
}

But I got an error below:

$ ops instance logs test

en1: assigned 10.0.2.15
NVRM _sysCreateOs: RM Access Sys Cap creation failed: 0x56
NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  535.113.01  Release Build  (root@ipfs)  Tue Jun  4 01:42:09 PM CST 2024
Loaded the UVM driver, major device number 0.
deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 304
-> OS call failed or operation not supported on this OS
Result = FAIL

$ nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

Could you please provide guidance on how to resolve this issue?
Thank you very much for your time and help.

rinor mentioned this issue May 15, 2024

fix: build on latest nanos nanovms/gpu-nvidia#5

Merged

eyberg added the question label May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU integration with nanos #1621

GPU integration with nanos #1621

zeroecco commented May 13, 2024

eyberg commented May 13, 2024

francescolavra commented May 14, 2024

rinor commented May 14, 2024 •

edited

francescolavra commented May 15, 2024

zeroecco commented May 15, 2024

francescolavra commented May 15, 2024

zeroecco commented May 15, 2024

francescolavra commented May 15, 2024

zeroecco commented May 15, 2024

0x5459 commented Jun 3, 2024 •

edited

francescolavra commented Jun 4, 2024

0x5459 commented Jun 4, 2024 •

edited

0x5459 commented Jun 4, 2024

francescolavra commented Jun 4, 2024 •

edited

0x5459 commented Jun 4, 2024 •

edited

GPU integration with nanos #1621

GPU integration with nanos #1621

Comments

zeroecco commented May 13, 2024

eyberg commented May 13, 2024

francescolavra commented May 14, 2024

rinor commented May 14, 2024 • edited

francescolavra commented May 15, 2024

zeroecco commented May 15, 2024

francescolavra commented May 15, 2024

zeroecco commented May 15, 2024

francescolavra commented May 15, 2024

zeroecco commented May 15, 2024

0x5459 commented Jun 3, 2024 • edited

francescolavra commented Jun 4, 2024

0x5459 commented Jun 4, 2024 • edited

0x5459 commented Jun 4, 2024

francescolavra commented Jun 4, 2024 • edited

0x5459 commented Jun 4, 2024 • edited

rinor commented May 14, 2024 •

edited

0x5459 commented Jun 3, 2024 •

edited

0x5459 commented Jun 4, 2024 •

edited

francescolavra commented Jun 4, 2024 •

edited

0x5459 commented Jun 4, 2024 •

edited