Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU integration with nanos #1621

Open
zeroecco opened this issue May 13, 2024 · 15 comments
Open

GPU integration with nanos #1621

zeroecco opened this issue May 13, 2024 · 15 comments
Labels

Comments

@zeroecco
Copy link
Contributor

Hello!

I am attempting to get a working unikernel on my workstation (going through this blog: https://nanovms.com/dev/tutorials/gpu-accelerated-computing-nanos-unikernels) but running into a number of hurdles I thought I should document and ask for assistance on:

  • First thing I came across: I cannot build the klib on main. I hurdled this by checking out 0.1.50 of nanos

  • Second thing: The gpu repo was updated to support nvidia driver version 535, but there are two bin files (gsp_ga10x.bin, and gsp_tu10x.bin), I copied both but not sure if that was the right choice.

  • Third thing: Following the guide, the ops config is wrong for current code. "Klibs": ["gpu_nvidia"], needs to be outside of the run config based on the ops docs (ops also complained about the config being wrong).

  • Fourth thing and where I am currently stuck: ops bombs out immediately saying invalid GPU type. Not sure where to look from here on what I am doing wrong. Any debugging steps I should take from here?

Here is the current output:

ops run -c ops.config main
running local instance
booting /root/.ops/images/main ...
Invalid GPU type 'nvidia-tesla-t4'
cat ops.config
{
  "RunConfig": {
    "GPUs": 1,
    "GPUType": "nvidia-tesla-t4"
  },
  "Klibs": ["gpu_nvidia"],
  "Dirs": ["nvidia"]
}
@eyberg
Copy link
Contributor

eyberg commented May 13, 2024

that article was written a while ago

are you trying to run this locally or in the cloud? if local - there is additional work that one has to use it locally: #1528 - the older article you linked was for gcp specifically, (we have an outstanding task to document the onprem setup nanovms/ops-documentation#430 )

@francescolavra
Copy link
Member

* First thing I came across: I cannot build the klib on main. I hurdled this by checking out 0.1.50 of nanos

There has indeed been a recent change (in nanovms/nanos#2011) in the nanos interrupt API, and the nvidia klib hasn't been updated yet to adapt to that change. If you want, you can check out the kernel version prior to that PR and build the klib against that. Also please note that in order to be able to build the klib you have to build nanos itself first

* Second thing: The gpu repo was updated to support nvidia driver version 535, but there are two bin files (gsp_ga10x.bin, and gsp_tu10x.bin), I copied both but not sure if that was the right choice.

Copying both is fine, the driver will pick the right one depending on which GPU type it detects

* Fourth thing and where I am currently stuck: ops bombs out immediately saying invalid GPU type. Not sure where to look from here on what I am doing wrong. Any debugging steps I should take from here?

The only "GPUType" you can set in the config when running locally is "pci-passthrough" (but you can just omit the "GPUType" option altogether, since pci-passthrough is the default setting). This will detect the GPU(s) connected to the PCI bus of your machine, and should work with any supported Nvidia GPU type.

@rinor
Copy link
Contributor

rinor commented May 14, 2024

last time I checked the build, the only change I made was:

diff --git a/kernel-open/nvidia/nv-msi.c b/kernel-open/nvidia/nv-msi.c
index 020ef53..a0c2be9 100644
--- a/kernel-open/nvidia/nv-msi.c
+++ b/kernel-open/nvidia/nv-msi.c
@@ -55,7 +55,8 @@ void NV_API_CALL nv_init_msi(nv_state_t *nv)
         }
         else
         {
-            msi_format(&address, &data, nv->interrupt_line);
+            u32 target_cpu = irq_get_target_cpu(irange(0, 0));
+            msi_format(&address, &data, nv->interrupt_line, target_cpu);
             pci_cfgwrite(dev, cp + 4, 4, address);    /* address low */
             pci_cfgwrite(dev, cp + 8, 4, 0);          /* address high */
             pci_cfgwrite(dev, cp + 12, 4, data);      /* data */

can't confirm that it is correct, just that it builds fine with the latest nanos.

@francescolavra
Copy link
Member

Yes, that is a correct change. Thanks

@zeroecco
Copy link
Contributor Author

thanks for all this feedback! I will try it and let you know ASAP

@francescolavra
Copy link
Member

nanovms/gpu-nvidia#5 has been merged in our gpu-nvidia repository, so the klib now builds successfully against the master branch of nanos.

@zeroecco
Copy link
Contributor Author

closer:

root@north:~/r0uk# ops run -c ops.config main
running local instance
booting /root/.ops/images/main ...
en1: assigned 10.0.2.15
NVRM _sysCreateOs: RM Access Sys Cap creation failed: 0x56
NVRM cpuidInfoAMD: Unrecognized AMD processor in cpuidInfoAMD
NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  535.113.01  Release Build  (root@north)  Mon May 13 02:04:21 AM UTC 2024
Loaded the UVM driver, major device number 0.
2024/05/15 17:50:16 Listening...on 8080
en1: assigned FE80::30A6:AEFF:FE3E:B03D
^Cqemu-system-x86_64: terminating on signal 2
signal: killed
root@north:~/r0uk#

@francescolavra
Copy link
Member

As written in the tutorial, the line "Loaded the UVM driver, major device number 0" indicates that the GPU klib was loaded successfully, and the GPU attached to your instance is available for your application to use. Are you facing any issues?

@zeroecco
Copy link
Contributor Author

not anymore on the nightly, thanks for your guidance

@0x5459
Copy link

0x5459 commented Jun 3, 2024

I am getting the following error (GeForce RTX 3080):

en1: assigned 10.0.2.15
NVRM _sysCreateOs: RM Access Sys Cap creation failed: 0x56
NVRM: failed to register character device.
klib automatic load failed (4)

@francescolavra
Copy link
Member

The above error means the klib failed to create the /dev/nvidiactl file which is used by the userspace nvidia drivers to interface with the GPU. @0x5459 is there anything already at that path in the image you are using?
How are you starting the Nanos instance? If you are using Ops, can you share your command line and your json configuration file?

@0x5459
Copy link

0x5459 commented Jun 4, 2024

I suspect that the inconsistency between my CUDA version and driver version is causing the issue. My program is compiled with CUDA 11. Now, I am trying to install CUDA 12. I will reply here with any updates.

My config:

{
  "RebootOnExit": true,
  "ManifestPassthrough": {
    "readonly_rootfs": "true"
  },
  "Env": {
    "RUST_BACKTRACE": "1",
    "RUST_LOG": "debug",
  },
  "Program": "c2-test",
  "KlibDir": "/root/code/gpu-nvidia/kernel-open/_out/Nanos_x86_64",
  "Klibs": ["gpu_nvidia"],
  "Dirs": ["nvidia"],
  "Mounts": {
    "/root/dataset": "/dataset"
  },
  "RunConfig": {
    "CPUs": 32,
    "Memory": "64g",
    "GPUs": 1
  }
}

@0x5459
Copy link

0x5459 commented Jun 4, 2024

I have tried to compile my program with cuda12.2 but still get the same error. Could you give me some help? @francescolavra

@francescolavra
Copy link
Member

francescolavra commented Jun 4, 2024

The problem is in the fact that your root filesystem is being configured as read-only (via the "readonly_rootfs": "true" option in your config). This prevents the klib from creating the /dev/nvidiactl file, and that causes the "failed to register character device" error.

@0x5459
Copy link

0x5459 commented Jun 4, 2024

@francescolavra Hi, I have a new issue.

I built deviceQuery program from cuda-samples using following config:

{
  "Program": "deviceQuery",
  "KlibDir": "/root/code/gpu-nvidia/kernel-open/_out/Nanos_x86_64",
  "Klibs": ["gpu_nvidia"],
  "Dirs": ["nvidia"],
  "RunConfig": {
    "GPUs": 1
  }
}

But I got an error below:

$ ops instance logs test

en1: assigned 10.0.2.15
NVRM _sysCreateOs: RM Access Sys Cap creation failed: 0x56
NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  535.113.01  Release Build  (root@ipfs)  Tue Jun  4 01:42:09 PM CST 2024
Loaded the UVM driver, major device number 0.
deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 304
-> OS call failed or operation not supported on this OS
Result = FAIL
$ nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

Could you please provide guidance on how to resolve this issue?
Thank you very much for your time and help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants