UserWarning: cuDA initialization: Unexpected error from cudaGetDevicecount(). #8269

jiaqizhang123-stack · 2024-01-31T10:57:50Z

Hello, when I used pyinstaller to package the training code(torch, cuda, linux), the cuda can not be used on another computer. Is this because cuda wasn't packaged?
The error is

Thank you for your help

rokm · 2024-01-31T11:35:07Z

Probably. Try updating pyinstaller-hooks-contrib to the latest version (which has pyinstaller/pyinstaller-hooks-contrib#676) and see if that gets you any further.

jiaqizhang123-stack · 2024-01-31T13:07:35Z

The pyinstaller I installed is already the latest version

rokm · 2024-01-31T13:17:06Z

How did you install torch? pip or conda?

jiaqizhang123-stack · 2024-01-31T13:20:37Z

Download the. whl file and install it using pip

rokm · 2024-01-31T13:41:41Z

Hmm, if you freeze the following example

import torch
print("torch.cuda.is_available:", torch.cuda.is_available())
print("torch.cuda.device_count:", torch.cuda.device_count())

does it work? Or does cuda.is_available also return false?

Does target computer have NVIDIA driver installed?

rokm · 2024-01-31T14:01:25Z

That Unexpected error from cudaGetDeviceCount and forward compatibility was attempted on non-supported HW likely implies driver issue. Does CUDA otherwise work on the target machine? Does nvidia-smi show any errors? See pytorch/pytorch#40671

jiaqizhang123-stack · 2024-02-01T01:28:43Z

nvidia-smi does not show any errors

jiaqizhang123-stack · 2024-02-01T02:19:52Z

Hmm, if you freeze the following example
import torch
print("torch.cuda.is_available:", torch.cuda.is_available())
print("torch.cuda.device_count:", torch.cuda.device_count())
does it work? Or does cuda.is_available also return false?

Does target computer have NVIDIA driver installed?

When I package using different versions of cuda and torch, the cuda counts obtained from the target computer are inconsistent, and the graphics card of the target computer is GTX1650
cuda11.7

cuda10.2

rokm · 2024-02-01T13:43:03Z

FWIW, if I freeze

import torch
print("torch.cuda.is_available:", torch.cuda.is_available())
print("torch.cuda.device_count:", torch.cuda.device_count())

in a clean python 3.9.18 virtual environment on my Fedora 39 desktop (with torch installed via pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117), and transfer the program to my Fedora 39 notebook, it works as expected - torch.cuda.is_available() returns True and torch.cuda.device_count() returns 1. Both systems are running 545.29.06, though (and neither system has system-installed CUDA toolkit - just the driver and its CUDA run-time components).

If you set up a python virtual environment on the target system and install torch in there, does running torch.cuda.is_available() in python interactive prompt work as expected?

jiaqizhang123-stack · 2024-02-02T01:25:49Z

When the driver versions of two computers are the same, torch.cuda.is_available() returns True. Why does the driver version affect the use of CUDA? Is this related to the incomplete package included during packaging？
Cuda11.7 troch1.13.1
The loss function will result in a nan value on gtx1650 with the packaged exe, is it because gtx 1650 is not compatible with cuda11.7?

rokm · 2024-02-02T09:19:24Z

When the driver versions of two computers are the same, torch.cuda.is_available() returns True. Why does the driver version affect the use of CUDA? Is this related to the incomplete package included during packaging？ Cuda11.7 troch1.13.1 The loss function will result in a nan value on gtx1650 with the packaged exe, is it because gtx 1650 is not compatible with cuda11.7?

Ah - in that case, the problem is likely that we collect a part of driver libraries that we shouldn't be collecting.

rokm · 2024-02-02T09:28:09Z

If you are using onedir mode, can you try removing libcuda.so.1 from the frozen application directory?

Alternatively, you can add r'libcuda\.so(\..*)?' to this list in your copy of PyInstaller (and rebuild the application with --clean option):

pyinstaller/PyInstaller/depend/dylib.py

Lines 187 to 223 in 249d8fc

    
           _unix_excludes = { 
        
               r'libc\.so(\..*)?', 
        
               r'libdl\.so(\..*)?', 
        
               r'libm\.so(\..*)?', 
        
               r'libpthread\.so(\..*)?', 
        
               r'librt\.so(\..*)?', 
        
               r'libthread_db\.so(\..*)?', 
        
               # glibc regex excludes. 
        
               r'ld-linux\.so(\..*)?', 
        
               r'libBrokenLocale\.so(\..*)?', 
        
               r'libanl\.so(\..*)?', 
        
               r'libcidn\.so(\..*)?', 
        
               r'libcrypt\.so(\..*)?', 
        
               r'libnsl\.so(\..*)?', 
        
               r'libnss_compat.*\.so(\..*)?', 
        
               r'libnss_dns.*\.so(\..*)?', 
        
               r'libnss_files.*\.so(\..*)?', 
        
               r'libnss_hesiod.*\.so(\..*)?', 
        
               r'libnss_nis.*\.so(\..*)?', 
        
               r'libnss_nisplus.*\.so(\..*)?', 
        
               r'libresolv\.so(\..*)?', 
        
               r'libutil\.so(\..*)?', 
        
               # graphical interface libraries come with graphical stack (see libglvnd) 
        
               r'libE?(Open)?GLX?(ESv1_CM|ESv2)?(dispatch)?\.so(\..*)?', 
        
               r'libdrm\.so(\..*)?', 
        
               # a subset of libraries included as part of the Nvidia Linux Graphics Driver as of 520.56.06: 
        
               # https://download.nvidia.com/XFree86/Linux-x86_64/520.56.06/README/installedcomponents.html 
        
               r'nvidia_drv\.so', 
        
               r'libglxserver_nvidia\.so(\..*)?', 
        
               r'libnvidia-egl-(gbm|wayland)\.so(\..*)?', 
        
               r'libnvidia-(cfg|compiler|e?glcore|glsi|glvkspirv|rtcore|allocator|tls|ml)\.so(\..*)?', 
        
               r'lib(EGL|GLX)_nvidia\.so(\..*)?', 
        
               # libxcb-dri changes ABI frequently (e.g.: between Ubuntu LTS releases) and is usually installed as dependency of 
        
               # the graphics stack anyway. No need to bundle it. 
        
               r'libxcb\.so(\..*)?', 
        
               r'libxcb-dri.*\.so(\..*)?', 
        
           }

Does that fix the problem?

jiaqizhang123-stack · 2024-02-03T10:01:40Z

When I test on GTX 4090, I have this problem again, checking the files inside onedir mode, there is no libcudnn ops infer.so.8. library in it, on gtx1650, it is able to run, the gtx4090 reports an error!

When I use onedir mode, removing libcuda.so.1 is ok, onefile mode is not tested yet.

rokm · 2024-02-03T10:39:27Z

If this is still torch 1.13.1+cu117, then originally, the libcudnn_ops_infer.so.8 is in site-packages/torch/lib/libcudnn_ops_infer.so.8 - and thus should be collected to torch/lib/libcudnn_ops_infer.so.8.

Can you check what are the contents of torch/lib directory in onedir frozen application, and paste the list of files here?

jiaqizhang123-stack · 2024-02-03T10:45:53Z

rokm · 2024-02-03T10:50:57Z

Hmmm... in that case, can you rebuild again (with --clean option), and watch the build log for line

...
INFO: Loading module hook 'hook-torch.py' from '/path/to/hook'...
...

to see where the torch hook is loaded from; then open that file, and paste its contents here.

jiaqizhang123-stack · 2024-02-03T10:58:33Z

rokm · 2024-02-03T11:02:07Z

What are contents of the hook-torch.py file in that directory?

jiaqizhang123-stack · 2024-02-03T11:11:58Z

rokm · 2024-02-03T11:21:09Z

That's not from pyinstaller-hooks-contrib 2024.0 - which should look like this.

Your hook file seems to be from 2023.10 or earlier.

Although since it is an old brute-force hook that collects whole torch directory, it should collect the cudnn libs from torch/lib as well (2023.11 would likely fail to collect versioned .so.8 files, while 2023.12 and 2024.0 should be OK again).

So next question is, what are the contents of the site-packages/torch/lib directory? Are libcudnn_*.so.8 files in there in the first place?

(If you were not rebuilding with --clean before, did clean build perhaps collect the cudnn files in the frozen application?)

jiaqizhang123-stack · 2024-02-03T11:28:16Z

libcudnn_*.so.8 files in there are not in the first place

rokm · 2024-02-03T11:31:10Z

Is this conda-installed torch, then? At the beginning, you said you downloaded .whl files - from where?

jiaqizhang123-stack · 2024-02-03T11:33:13Z

I installed it using the pip .whl file， Not installed with conda

rokm · 2024-02-03T11:33:51Z

Where did you get the .whl file from?

jiaqizhang123-stack · 2024-02-03T11:42:03Z

Sorry, I think I installed this torch with conda when I created the environment. I updated pyinstaller-hooks-contrib.libcudnn_*.so.8 files in there are not in the first place

rokm · 2024-02-03T11:48:42Z

The hook does not really support conda-installed torch. Because in that case, CUDA and cuDNN libs are not part of the torch package, and would need to be collected from conda environment. But we can automatically collect only the ones that are link-time dependencies (which are picked up by our binary dependency analysis), while the ones that are dynamically loaded at run-time (like the cuDNN libs) are not automatically collected. You'd need to manually ensure they are collected (for example, via --add-binary) or manually copy them into (onedir) frozen application's top-level directory.

rokm · 2024-02-03T12:18:50Z

Hmm, actually, looks like cuDNN libs are part of torch conda package (installed as conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia as per https://pytorch.org/get-started/previous-versions/):

$ ls -l /home/rok/miniconda3/envs/pyi-torch/lib/python3.9/site-packages/torch/lib
total 2268760
-rwxr-xr-x. 2 rok rok    415264 Dec  8  2022 libc10_cuda.so
-rwxr-xr-x. 2 rok rok    882160 Dec  8  2022 libc10.so
-rwxr-xr-x. 2 rok rok     17544 Dec  8  2022 libcaffe2_nvrtc.so
-rwxr-xr-x. 2 rok rok 113856992 Dec  8  2022 libcudnn_adv_infer.so.8
-rwxr-xr-x. 2 rok rok  95519224 Dec  8  2022 libcudnn_adv_train.so.8
-rwxr-xr-x. 2 rok rok 458532072 Dec  8  2022 libcudnn_cnn_infer.so.8
-rwxr-xr-x. 2 rok rok  74242352 Dec  8  2022 libcudnn_cnn_train.so.8
-rwxr-xr-x. 2 rok rok  91406344 Dec  8  2022 libcudnn_ops_infer.so.8
-rwxr-xr-x. 2 rok rok  67609952 Dec  8  2022 libcudnn_ops_train.so.8
-rwxr-xr-x. 2 rok rok    163040 Dec  8  2022 libcudnn.so.8
-rwxr-xr-x. 2 rok rok   7104344 Dec  8  2022 libcupti-6ac96871.so.11.7
-rwxr-xr-x. 2 rok rok     44656 Dec  8  2022 libshm.so
-rwxr-xr-x. 2 rok rok 306617864 Dec  8  2022 libtorch_cpu.so
-rwxr-xr-x. 2 rok rok 149299960 Dec  8  2022 libtorch_cuda_cpp.so
-rwxr-xr-x. 2 rok rok 852888760 Dec  8  2022 libtorch_cuda_cu.so
-rwxr-xr-x. 2 rok rok  82386400 Dec  8  2022 libtorch_cuda_linalg.so
-rwxr-xr-x. 2 rok rok    166632 Dec  8  2022 libtorch_cuda.so
-rwxr-xr-x. 2 rok rok     15696 Dec  8  2022 libtorch_global_deps.so
-rwxr-xr-x. 2 rok rok  21984128 Dec  8  2022 libtorch_python.so
-rwxr-xr-x. 2 rok rok     15480 Dec  8  2022 libtorch.so

and if I freeze a test torch program in that miniconda environment, they all end up collected:

ls -l dist/program/_internal/torch/lib
total 2268760
-rwxr-xr-x. 1 rok rok    415264 Feb  3 13:08 libc10_cuda.so
-rwxr-xr-x. 1 rok rok    882160 Feb  3 13:08 libc10.so
-rwxr-xr-x. 1 rok rok     17544 Feb  3 13:08 libcaffe2_nvrtc.so
-rwxr-xr-x. 1 rok rok 113856992 Feb  3 13:08 libcudnn_adv_infer.so.8
-rwxr-xr-x. 1 rok rok  95519224 Feb  3 13:08 libcudnn_adv_train.so.8
-rwxr-xr-x. 1 rok rok 458532072 Feb  3 13:08 libcudnn_cnn_infer.so.8
-rwxr-xr-x. 1 rok rok  74242352 Feb  3 13:08 libcudnn_cnn_train.so.8
-rwxr-xr-x. 1 rok rok  91406344 Feb  3 13:08 libcudnn_ops_infer.so.8
-rwxr-xr-x. 1 rok rok  67609952 Feb  3 13:08 libcudnn_ops_train.so.8
-rwxr-xr-x. 1 rok rok    163040 Feb  3 13:08 libcudnn.so.8
-rwxr-xr-x. 1 rok rok   7104344 Feb  3 13:08 libcupti-6ac96871.so.11.7
-rwxr-xr-x. 1 rok rok     44656 Feb  3 13:08 libshm.so
-rwxr-xr-x. 1 rok rok 306617864 Feb  3 13:08 libtorch_cpu.so
-rwxr-xr-x. 1 rok rok 149299960 Feb  3 13:08 libtorch_cuda_cpp.so
-rwxr-xr-x. 1 rok rok 852888760 Feb  3 13:08 libtorch_cuda_cu.so
-rwxr-xr-x. 1 rok rok  82386400 Feb  3 13:08 libtorch_cuda_linalg.so
-rwxr-xr-x. 1 rok rok    166632 Feb  3 13:08 libtorch_cuda.so
-rwxr-xr-x. 1 rok rok     15696 Feb  3 13:08 libtorch_global_deps.so
-rwxr-xr-x. 1 rok rok  21984128 Feb  3 13:08 libtorch_python.so
-rwxr-xr-x. 1 rok rok     15480 Feb  3 13:08 libtorch.so

jiaqizhang123-stack · 2024-02-04T09:24:45Z

Once the above problems were solved, new problems arose

rokm · 2024-02-04T09:58:14Z

This looks like external cuDNN (in /home/zhang/cuda/lib64) being mixed with cuDNN that was collected into frozen application. Is libcudnn_cnn_train.so.8 collected in _internal/torch/libs? And is there a symbolic link to that file in the top-level _internal directory?

What happens if you temporarily remove /home/zhang/cuda/lib64 from LD_LIBRARY_PATH in the target environment?

jiaqizhang123-stack · 2024-02-04T10:22:08Z

If I temporarily remove /home/zhang/cuda/lib64 from LD_LIBRARY_PATH in the target environment，the program can run,

jiaqizhang123-stack · 2024-02-04T10:25:16Z

There are also conflicts between _internal and torch/lib, and _internal's internal files seem to from cudnn

rokm · 2024-02-04T10:33:36Z

Hmmm, yeah, those libcudnn_* files in top-level _internal directory should be symbolic links to their counterparts in _internal/toch/lib:

ls -l dist/program/_internal | grep libcudnn
lrwxrwxrwx.  1 rok rok      33 feb  3 19:53 libcudnn_adv_infer.so.8 -> torch/lib/libcudnn_adv_infer.so.8
lrwxrwxrwx.  1 rok rok      33 feb  3 19:53 libcudnn_cnn_infer.so.8 -> torch/lib/libcudnn_cnn_infer.so.8
lrwxrwxrwx.  1 rok rok      33 feb  3 19:53 libcudnn_ops_infer.so.8 -> torch/lib/libcudnn_ops_infer.so.8
lrwxrwxrwx.  1 rok rok      33 feb  3 19:53 libcudnn_ops_train.so.8 -> torch/lib/libcudnn_ops_train.so.8
lrwxrwxrwx.  1 rok rok      23 feb  3 19:53 libcudnn.so.8 -> torch/lib/libcudnn.so.8

Can you open build/<name>/Analysis-00.toc and search for all instances of for example libcudnn_adv_infer.so.8, and check where they were collected from?

Do you also have an external CUDA toolkit in LD_LIBRARY_PATH on the build system (similarly to how you had it on the target system in /home/zhang/cuda/lib64? If so, can you try temporarily removing it from LD_LIBRARY_PATH and rebuild the program?

jiaqizhang123-stack · 2024-02-04T10:42:47Z

I deleted this file, but I compared it with the file under cudnn and it's the same size
I does not have an external CUDA toolkit in LD_LIBRARY_PATH on the build system, how can I avoid prioritizing calls to external environments when running on the target machine?

rokm · 2024-02-04T10:45:15Z

how can I avoid prioritizing calls to external environments when running on the target machine?

You cannot, that's the problem.

I does not have an external CUDA toolkit in LD_LIBRARY_PATH on the build system

Do you have CUDA and cuDNN installed in the conda environment, then (for example, it would be installed via conda if this is the same environment that you previously had conda-installed torch in).

rokm · 2024-02-04T10:48:32Z

how can I avoid prioritizing calls to external environments when running on the target machine?

You cannot, that's the problem.

I.e., the binary dependency analysis on Linux tries to resolve the shared lib dependencies via ldd, and that probably ends up pulling in the external environment (and only if that fails, we try the parent directory as a work-around). I plan to rework this a bit (in particular, if binary was already discovered via hooks, that should be accounted for in binary dependency analysis). And for this particular case, I would like to figure out what exactly is going on, so I can have a local test case to work with.

rokm · 2024-02-04T10:51:59Z

Uh, wait, just to reconfirm - are you now using conda-installed or pip-installed torch?

jiaqizhang123-stack · 2024-02-04T11:03:24Z

Pip installed torch I have installed the cuda environment on the target computer, and the packaged exe with onefile will first call the cuda environment of the local environment, causing an error. If the local cuda is removed, the program can run normally 发自我的iPhone

…

------------------ Original ------------------ From: Rok Mandeljc ***@***.***> Date: Sun,Feb 4,2024 6:52 PM To: pyinstaller/pyinstaller ***@***.***> Cc: jiaqizhang123-stack ***@***.***>, Author ***@***.***> Subject: Re: [pyinstaller/pyinstaller] UserWarning: cuDA initialization:Unexpected error from cudaGetDevicecount(). (Issue #8269) Uh, wait, just to reconfirm - are you now using conda-installed or pip-installed torch? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

rokm · 2024-02-04T16:17:53Z

OK, I've tried to build and run the following example program

# train_program.py
from ultralytics import YOLO

model = YOLO('yolov8n.yaml').load('yolov8n.pt')  # build from YAML and transfer weights
results = model.train(data='coco128.yaml', epochs=2, imgsz=640)

with system-installed python 3.9.18 (i.e., no conda) with clean virtual environment and pip-installed torch==1.13.1+cu117 and ultralytics==8.0.81. (For easier debugging, this is onedir build).

The libcu* files in the top-level _internal directory are symlinks:

$ls -l dist/train_program/_internal/ | grep libcu
lrwxrwxrwx.  1 rok rok      27 feb  4 16:45 libcublasLt.so.11 -> torch/lib/libcublasLt.so.11
lrwxrwxrwx.  1 rok rok      25 feb  4 16:45 libcublas.so.11 -> torch/lib/libcublas.so.11
lrwxrwxrwx.  1 rok rok      43 feb  4 16:45 libcudart.782fcab0.so.11.0 -> torchvision.libs/libcudart.782fcab0.so.11.0
lrwxrwxrwx.  1 rok rok      36 feb  4 16:45 libcudart-e409450e.so.11.0 -> torch/lib/libcudart-e409450e.so.11.0
lrwxrwxrwx.  1 rok rok      33 feb  4 16:45 libcudnn_adv_infer.so.8 -> torch/lib/libcudnn_adv_infer.so.8
lrwxrwxrwx.  1 rok rok      33 feb  4 16:45 libcudnn_cnn_infer.so.8 -> torch/lib/libcudnn_cnn_infer.so.8
lrwxrwxrwx.  1 rok rok      33 feb  4 16:45 libcudnn_ops_infer.so.8 -> torch/lib/libcudnn_ops_infer.so.8
lrwxrwxrwx.  1 rok rok      33 feb  4 16:45 libcudnn_ops_train.so.8 -> torch/lib/libcudnn_ops_train.so.8
lrwxrwxrwx.  1 rok rok      23 feb  4 16:45 libcudnn.so.8 -> torch/lib/libcudnn.so.8

to their copies in the torch/lib directory:

$ ls -l dist/train_program/_internal/torch/lib
total 3169480
-rwxr-xr-x. 1 rok rok   1230457 feb  4 16:45 libc10_cuda.so
-rwxr-xr-x. 1 rok rok    878057 feb  4 16:45 libc10.so
-rwxr-xr-x. 1 rok rok     25521 feb  4 16:45 libcaffe2_nvrtc.so
-rwxr-xr-x. 1 rok rok 348150584 feb  4 16:45 libcublasLt.so.11
-rwxr-xr-x. 1 rok rok 156720544 feb  4 16:45 libcublas.so.11
-rwxr-xr-x. 1 rok rok    687321 feb  4 16:45 libcudart-e409450e.so.11.0
-rwxr-xr-x. 1 rok rok 113856969 feb  4 16:45 libcudnn_adv_infer.so.8
-rwxr-xr-x. 1 rok rok  95519201 feb  4 16:45 libcudnn_adv_train.so.8
-rwxr-xr-x. 1 rok rok 458532049 feb  4 16:45 libcudnn_cnn_infer.so.8
-rwxr-xr-x. 1 rok rok  74242329 feb  4 16:45 libcudnn_cnn_train.so.8
-rwxr-xr-x. 1 rok rok  91406321 feb  4 16:45 libcudnn_ops_infer.so.8
-rwxr-xr-x. 1 rok rok  67609937 feb  4 16:45 libcudnn_ops_train.so.8
-rwxr-xr-x. 1 rok rok    150200 feb  4 16:45 libcudnn.so.8
-rwxr-xr-x. 1 rok rok    168721 feb  4 16:45 libgomp-a34b3233.so.1
-rwxr-xr-x. 1 rok rok   7079489 feb  4 16:45 libnvrtc-builtins.so.11.7
-rwxr-xr-x. 1 rok rok  45791369 feb  4 16:45 libnvrtc-d833c4f3.so.11.2
-rwxr-xr-x. 1 rok rok     43681 feb  4 16:45 libnvToolsExt-847d78f2.so.1
-rwxr-xr-x. 1 rok rok     44560 feb  4 16:45 libshm.so
-rwxr-xr-x. 1 rok rok 539370033 feb  4 16:45 libtorch_cpu.so
-rwxr-xr-x. 1 rok rok 264587265 feb  4 16:45 libtorch_cuda_cpp.so
-rwxr-xr-x. 1 rok rok 741290633 feb  4 16:45 libtorch_cuda_cu.so
-rwxr-xr-x. 1 rok rok 214105800 feb  4 16:45 libtorch_cuda_linalg.so
-rwxr-xr-x. 1 rok rok    166536 feb  4 16:45 libtorch_cuda.so
-rwxr-xr-x. 1 rok rok     20817 feb  4 16:45 libtorch_global_deps.so
-rwxr-xr-x. 1 rok rok  23745625 feb  4 16:45 libtorch_python.so
-rwxr-xr-x. 1 rok rok     15384 feb  4 16:45 libtorch.so

(note that not all libcu* files are symlinked to top-level _internal directory).

If I try to run it (on the build machine), it ends up crashing when starting first training epoch, with

Plotting labels to runs/detect/train9/labels.jpg... 
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/detect/train9
Starting training for 2 epochs...

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
  0%|          | 0/8 [00:00<?, ?it/s]Could not load library libcudnn_cnn_train.so.8. Error: libcudnn_cnn_train.so.8: cannot open shared object file: No such file or directory

This one seems to be caused by libcudnn.so.8 being symlinked to top-level _internal, while libcudnn_cnn_train.so.8 is not symlinked (which happens because it was not a link-time dependency for any collected binary).

Either removing libcudnn.so.8 symlink from _internal, or adding libcudnn_cnn_train.so.8 -> torch/lib/libcudnn_cnn_train.so.8 symlink to _internal seems to fix this particular problem.

Can you try building the above test program in your environment? Does it run as expected (on the build machine, and on the target machine)?
Are libcu* files in _internal directory hard copies or symlinks? (Please post the output of ls -l commands ran from linux on the build machine, instead of looking at files in Windows explorer).
Can archive the whole build/train_program directory for the above test program (or at least build/train_program/Analysis-00.toc file) and upload it somewhere so I can download it and take a look at it?

Based on the screenshots you've provided so far, it indeed looks like that a second set of CUDA/cuDNN libraries is collected from somewhere (hence libcu* files in your top-level _internal directory are not symlinks), and we first need to figure out where they came from.

jiaqizhang123-stack · 2024-02-05T07:31:23Z

Does it run as expected (on the build machine, and on the target machine)?

If I run it on the target machine, the error is that

Probably because I reinstalled cuda11.7 using the cudnn,

cp /media/newdata1/zjq/cudnn-linux-x86_64-8.9.7.29_cuda11-archive/include/cudnn*.h //media/newdata1/zjq/cuda-11.7/targets/x86_64-linux/include/ 
cp -P /media/newdata1/zjq/cudnn-linux-x86_64-8.9.7.29_cuda11-archive/lib/libcudnn* /media/newdata1/zjq/cuda-11.7/targets/x86_64-linux/lib/
chmod a+r /media/newdata1/zjq/cudnn-linux-x86_64-8.9.7.29_cuda11-archive/include/cudnn*.h /media/newdata1/zjq/cuda-11.7/targets/x86_64-linux/lib/libcudnn*

When I remove the cudnn, the error is that
When I remove the cuda on the target machine, it works fine. But the value of the loss function for training is nan, which is something I can't figure out

The libcu* files in the top-level _internal directory are symlinks:

ls -l dist_gitce/train/_internal/ | grep libcu
lrwxrwxrwx  1 zjq zjq        33 2月   5 14:51 libcublas.so.11 -> nvidia/cublas/lib/libcublas.so.11
-rwxr-xr-x  1 zjq zjq 348150584 2月   5 14:51 libcublasLt.so.11
lrwxrwxrwx  1 zjq zjq        36 2月   5 14:51 libcudart-e409450e.so.11.0 -> torch/lib/libcudart-e409450e.so.11.0
lrwxrwxrwx  1 zjq zjq        43 2月   5 14:51 libcudart.782fcab0.so.11.0 -> torchvision.libs/libcudart.782fcab0.so.11.0
-rwxr-xr-x  1 zjq zjq    671072 2月   5 14:51 libcudart.so.11.0
lrwxrwxrwx  1 zjq zjq        30 2月   5 14:51 libcudnn.so.8 -> nvidia/cudnn/lib/libcudnn.so.8
-rwxr-xr-x  1 zjq zjq 125384784 2月   5 14:51 libcudnn_adv_infer.so.8
-rwxr-xr-x  1 zjq zjq 563283840 2月   5 14:51 libcudnn_cnn_infer.so.8
-rwxr-xr-x  1 zjq zjq  90849728 2月   5 14:51 libcudnn_ops_infer.so.8
-rwxr-xr-x  1 zjq zjq  71053560 2月   5 14:51 libcudnn_ops_train.so.8

to their copies in the torch/lib directory:

ls -l dist_gitce/train/_internal/torch/lib | grep libcu
-rwxr-xr-x 1 zjq zjq    700096 2月   5 14:50 libcudart-e409450e.so.11.0

Since I pip install torch-1.13.1-cp39-cp39-manylinux1_x86_64.whl, I installed the torch along with the

nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96

ls -l dist_gitce/train/_internal/nvidia/cudnn/lib | grep libcu
-rwxr-xr-x 1 zjq zjq    150200 2月   5 14:51 libcudnn.so.8
-rwxr-xr-x 1 zjq zjq 113856992 2月   5 14:51 libcudnn_adv_infer.so.8
-rwxr-xr-x 1 zjq zjq  95519224 2月   5 14:51 libcudnn_adv_train.so.8
-rwxr-xr-x 1 zjq zjq 458532080 2月   5 14:51 libcudnn_cnn_infer.so.8
-rwxr-xr-x 1 zjq zjq  74242352 2月   5 14:51 libcudnn_cnn_train.so.8
-rwxr-xr-x 1 zjq zjq  91406344 2月   5 14:51 libcudnn_ops_infer.so.8
-rwxr-xr-x 1 zjq zjq  67609960 2月   5 14:51 libcudnn_ops_train.so.8

Analysis-00.zip

jiaqizhang123-stack · 2024-02-05T07:49:52Z

When I remove the cudnn, the error is that

rokm · 2024-02-05T11:32:53Z

OK, so now you've switched to torch-1.31.1 from PyPI, where torch itself does not bundle cuDNN shared libs in torch/lib directory, but depends on nvidia-cudnn to provide them instead (and same for other CUDA libs).

If I build the test training program using this version, it seems to work fine out of the box, without having to remove libcudnn.so.8 symlink from _internal, or add libcudnn_cnn_train.so.8 -> nvidia/cudnn/lib/libcudnn_cnn_train.so.8 symlink.

If I add external cuDNN to LD_LIBRARY_PATH (and that cuDNN is not specifically of compatible version) before the build, then I end up with hard copies of libcudnn_*.so.8 files in _internal (collected from external directory) instead of symlinks to nvidia/cudnn/lib/libcudnn_*.so.8 (which, according to your Analysis-00.toc, is what also happensi n your case). And because the external cuDNN version does not match the version used by torch, and results in an error similar to yours.

I'll need to check where and how this dependency leaks occurs, and what we can do about it. For now, the only way you can work around it is to ensure that you don't have external CUDA/cuDNN in LD_LIBRARY_PATH (or in standard /usr/lib) when running PyInstaller.

If I take the initial build (the one I said worked out of the box) and run it in environment that has external cuDNN in LD_LIBRARY_PATH, it also crashes due to mix of incompatible versions (specifically, it seems that external libcudnn_cnn_train.so.8 is being loaded, but cannot be because of missing/incompatible symbols).

Aside from removing external cuDNN from LD_LIBRARY_PATH, it seems this can also be fixed by adding libcudnn_cnn_train.so.8 -> nvidia/cudnn/lib/libcudnn_cnn_train.so.8 to _internal. (To be absolute sure, you can also add one for libcudnn_adv_train.so.8).

For now you need to do this manually, but eventually, our hooks will be able to add these missing symlinks automatically.

So to summarize:

you should ensure that you don't have external CUDA/cuDNN in LD_LIBRARY_PATH when building the application
you should add libcudnn_cnn_train.so.8 -> nvidia/cudnn/lib/libcudnn_cnn_train.so.8 symlink to _internal directory (if you want to do this for onefile build, you will have to add a.datas += [('libcudnn_cnn_train.so.8', 'nvidia/cudnn/lib/libcudnn_cnn_train.so.8', 'SYMLINK')] to the .spec file after Analysis is instantiated and before its fields are passed on to EXE).
you should remove libcuda.so.1 from _internal to make it portable across different driver versions (for onefile, you will need to filter a.binaries in the .spec file).

When I remove the cuda on the target machine, it works fine. But the value of the loss function for training is nan, which is something I can't figure out

I cannot really help with this, as I cannot reproduce the problem (without the code and data you are using), which might or might not be related to other issues we've seen here.

jiaqizhang123-stack · 2024-02-06T01:23:42Z

Ok，thank you so much!

jiaqizhang123-stack added the triage Please triage and relabel this issue label Jan 31, 2024

UserWarning: cuDA initialization: Unexpected error from cudaGetDevicecount(). #8269

UserWarning: cuDA initialization: Unexpected error from cudaGetDevicecount(). #8269

Comments

jiaqizhang123-stack commented Jan 31, 2024

rokm commented Jan 31, 2024

jiaqizhang123-stack commented Jan 31, 2024

rokm commented Jan 31, 2024

jiaqizhang123-stack commented Jan 31, 2024

rokm commented Jan 31, 2024 • edited

rokm commented Jan 31, 2024

jiaqizhang123-stack commented Feb 1, 2024

jiaqizhang123-stack commented Feb 1, 2024

rokm commented Feb 1, 2024

jiaqizhang123-stack commented Feb 2, 2024

rokm commented Feb 2, 2024

rokm commented Feb 2, 2024 • edited

jiaqizhang123-stack commented Feb 3, 2024

rokm commented Feb 3, 2024

jiaqizhang123-stack commented Feb 3, 2024

rokm commented Feb 3, 2024

jiaqizhang123-stack commented Feb 3, 2024

rokm commented Feb 3, 2024

jiaqizhang123-stack commented Feb 3, 2024

rokm commented Feb 3, 2024

jiaqizhang123-stack commented Feb 3, 2024

rokm commented Feb 3, 2024

jiaqizhang123-stack commented Feb 3, 2024

rokm commented Feb 3, 2024

jiaqizhang123-stack commented Feb 3, 2024

rokm commented Feb 3, 2024

rokm commented Feb 3, 2024 • edited

jiaqizhang123-stack commented Feb 4, 2024

rokm commented Feb 4, 2024

jiaqizhang123-stack commented Feb 4, 2024

jiaqizhang123-stack commented Feb 4, 2024

rokm commented Feb 4, 2024

jiaqizhang123-stack commented Feb 4, 2024

rokm commented Feb 4, 2024

rokm commented Feb 4, 2024

rokm commented Feb 4, 2024

jiaqizhang123-stack commented Feb 4, 2024 via email

rokm commented Feb 4, 2024

jiaqizhang123-stack commented Feb 5, 2024 • edited by rokm

jiaqizhang123-stack commented Feb 5, 2024

rokm commented Feb 5, 2024 • edited

jiaqizhang123-stack commented Feb 6, 2024

rokm commented Jan 31, 2024 •

edited

rokm commented Feb 2, 2024 •

edited

rokm commented Feb 3, 2024 •

edited

jiaqizhang123-stack commented Feb 5, 2024 •

edited by rokm

rokm commented Feb 5, 2024 •

edited