-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU integration with nanos #1621
Comments
that article was written a while ago are you trying to run this locally or in the cloud? if local - there is additional work that one has to use it locally: #1528 - the older article you linked was for gcp specifically, (we have an outstanding task to document the onprem setup nanovms/ops-documentation#430 ) |
There has indeed been a recent change (in nanovms/nanos#2011) in the nanos interrupt API, and the nvidia klib hasn't been updated yet to adapt to that change. If you want, you can check out the kernel version prior to that PR and build the klib against that. Also please note that in order to be able to build the klib you have to build nanos itself first
Copying both is fine, the driver will pick the right one depending on which GPU type it detects
The only "GPUType" you can set in the config when running locally is "pci-passthrough" (but you can just omit the "GPUType" option altogether, since pci-passthrough is the default setting). This will detect the GPU(s) connected to the PCI bus of your machine, and should work with any supported Nvidia GPU type. |
last time I checked the build, the only change I made was: diff --git a/kernel-open/nvidia/nv-msi.c b/kernel-open/nvidia/nv-msi.c
index 020ef53..a0c2be9 100644
--- a/kernel-open/nvidia/nv-msi.c
+++ b/kernel-open/nvidia/nv-msi.c
@@ -55,7 +55,8 @@ void NV_API_CALL nv_init_msi(nv_state_t *nv)
}
else
{
- msi_format(&address, &data, nv->interrupt_line);
+ u32 target_cpu = irq_get_target_cpu(irange(0, 0));
+ msi_format(&address, &data, nv->interrupt_line, target_cpu);
pci_cfgwrite(dev, cp + 4, 4, address); /* address low */
pci_cfgwrite(dev, cp + 8, 4, 0); /* address high */
pci_cfgwrite(dev, cp + 12, 4, data); /* data */ can't confirm that it is correct, just that it builds fine with the latest nanos. |
Yes, that is a correct change. Thanks |
thanks for all this feedback! I will try it and let you know ASAP |
nanovms/gpu-nvidia#5 has been merged in our gpu-nvidia repository, so the klib now builds successfully against the master branch of nanos. |
closer: root@north:~/r0uk# ops run -c ops.config main
running local instance
booting /root/.ops/images/main ...
en1: assigned 10.0.2.15
NVRM _sysCreateOs: RM Access Sys Cap creation failed: 0x56
NVRM cpuidInfoAMD: Unrecognized AMD processor in cpuidInfoAMD
NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64 535.113.01 Release Build (root@north) Mon May 13 02:04:21 AM UTC 2024
Loaded the UVM driver, major device number 0.
2024/05/15 17:50:16 Listening...on 8080
en1: assigned FE80::30A6:AEFF:FE3E:B03D
^Cqemu-system-x86_64: terminating on signal 2
signal: killed
root@north:~/r0uk# |
As written in the tutorial, the line "Loaded the UVM driver, major device number 0" indicates that the GPU klib was loaded successfully, and the GPU attached to your instance is available for your application to use. Are you facing any issues? |
not anymore on the nightly, thanks for your guidance |
I am getting the following error (GeForce RTX 3080):
|
The above error means the klib failed to create the /dev/nvidiactl file which is used by the userspace nvidia drivers to interface with the GPU. @0x5459 is there anything already at that path in the image you are using? |
I suspect that the inconsistency between my CUDA version and driver version is causing the issue. My program is compiled with CUDA 11. Now, I am trying to install CUDA 12. I will reply here with any updates. My config: {
"RebootOnExit": true,
"ManifestPassthrough": {
"readonly_rootfs": "true"
},
"Env": {
"RUST_BACKTRACE": "1",
"RUST_LOG": "debug",
},
"Program": "c2-test",
"KlibDir": "/root/code/gpu-nvidia/kernel-open/_out/Nanos_x86_64",
"Klibs": ["gpu_nvidia"],
"Dirs": ["nvidia"],
"Mounts": {
"/root/dataset": "/dataset"
},
"RunConfig": {
"CPUs": 32,
"Memory": "64g",
"GPUs": 1
}
}
|
I have tried to compile my program with cuda12.2 but still get the same error. Could you give me some help? @francescolavra |
The problem is in the fact that your root filesystem is being configured as read-only (via the |
@francescolavra Hi, I have a new issue. I built deviceQuery program from cuda-samples using following config: {
"Program": "deviceQuery",
"KlibDir": "/root/code/gpu-nvidia/kernel-open/_out/Nanos_x86_64",
"Klibs": ["gpu_nvidia"],
"Dirs": ["nvidia"],
"RunConfig": {
"GPUs": 1
}
} But I got an error below:
Could you please provide guidance on how to resolve this issue? |
Hello!
I am attempting to get a working unikernel on my workstation (going through this blog: https://nanovms.com/dev/tutorials/gpu-accelerated-computing-nanos-unikernels) but running into a number of hurdles I thought I should document and ask for assistance on:
First thing I came across: I cannot build the klib on main. I hurdled this by checking out 0.1.50 of nanos
Second thing: The gpu repo was updated to support nvidia driver version 535, but there are two bin files (gsp_ga10x.bin, and gsp_tu10x.bin), I copied both but not sure if that was the right choice.
Third thing: Following the guide, the ops config is wrong for current code.
"Klibs": ["gpu_nvidia"],
needs to be outside of the run config based on the ops docs (ops also complained about the config being wrong).Fourth thing and where I am currently stuck: ops bombs out immediately saying invalid GPU type. Not sure where to look from here on what I am doing wrong. Any debugging steps I should take from here?
Here is the current output:
The text was updated successfully, but these errors were encountered: