Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UserWarning: cuDA initialization: Unexpected error from cudaGetDevicecount(). #8269

Open
jiaqizhang123-stack opened this issue Jan 31, 2024 · 42 comments
Labels
triage Please triage and relabel this issue

Comments

@jiaqizhang123-stack
Copy link

Hello, when I used pyinstaller to package the training code(torch, cuda, linux), the cuda can not be used on another computer. Is this because cuda wasn't packaged?
The error is
e647cdbd1dd294f88319aaed550135f
Thank you for your help

@jiaqizhang123-stack jiaqizhang123-stack added the triage Please triage and relabel this issue label Jan 31, 2024
@rokm
Copy link
Member

rokm commented Jan 31, 2024

Probably. Try updating pyinstaller-hooks-contrib to the latest version (which has pyinstaller/pyinstaller-hooks-contrib#676) and see if that gets you any further.

@jiaqizhang123-stack
Copy link
Author

image
The pyinstaller I installed is already the latest version

@rokm
Copy link
Member

rokm commented Jan 31, 2024

How did you install torch? pip or conda?

@jiaqizhang123-stack
Copy link
Author

Download the. whl file and install it using pip

@rokm
Copy link
Member

rokm commented Jan 31, 2024

Hmm, if you freeze the following example

import torch
print("torch.cuda.is_available:", torch.cuda.is_available())
print("torch.cuda.device_count:", torch.cuda.device_count())

does it work? Or does cuda.is_available also return false?

Does target computer have NVIDIA driver installed?

@rokm
Copy link
Member

rokm commented Jan 31, 2024

That Unexpected error from cudaGetDeviceCount and forward compatibility was attempted on non-supported HW likely implies driver issue. Does CUDA otherwise work on the target machine? Does nvidia-smi show any errors? See pytorch/pytorch#40671

@jiaqizhang123-stack
Copy link
Author

4b666b63e435c5516c9e4014f685b7b
nvidia-smi does not show any errors

@jiaqizhang123-stack
Copy link
Author

Hmm, if you freeze the following example

import torch
print("torch.cuda.is_available:", torch.cuda.is_available())
print("torch.cuda.device_count:", torch.cuda.device_count())

does it work? Or does cuda.is_available also return false?

Does target computer have NVIDIA driver installed?

When I package using different versions of cuda and torch, the cuda counts obtained from the target computer are inconsistent, and the graphics card of the target computer is GTX1650
cuda11.7
25cd6474c3d4d258e422110dcea8665
cuda10.2
b9e5a04da870d377d3a5adc493543cb

@rokm
Copy link
Member

rokm commented Feb 1, 2024

FWIW, if I freeze

import torch
print("torch.cuda.is_available:", torch.cuda.is_available())
print("torch.cuda.device_count:", torch.cuda.device_count())

in a clean python 3.9.18 virtual environment on my Fedora 39 desktop (with torch installed via pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117), and transfer the program to my Fedora 39 notebook, it works as expected - torch.cuda.is_available() returns True and torch.cuda.device_count() returns 1. Both systems are running 545.29.06, though (and neither system has system-installed CUDA toolkit - just the driver and its CUDA run-time components).


If you set up a python virtual environment on the target system and install torch in there, does running torch.cuda.is_available() in python interactive prompt work as expected?

@jiaqizhang123-stack
Copy link
Author

When the driver versions of two computers are the same, torch.cuda.is_available() returns True. Why does the driver version affect the use of CUDA? Is this related to the incomplete package included during packaging?
Cuda11.7 troch1.13.1
The loss function will result in a nan value on gtx1650 with the packaged exe, is it because gtx 1650 is not compatible with cuda11.7?

@rokm
Copy link
Member

rokm commented Feb 2, 2024

When the driver versions of two computers are the same, torch.cuda.is_available() returns True. Why does the driver version affect the use of CUDA? Is this related to the incomplete package included during packaging? Cuda11.7 troch1.13.1 The loss function will result in a nan value on gtx1650 with the packaged exe, is it because gtx 1650 is not compatible with cuda11.7?

Ah - in that case, the problem is likely that we collect a part of driver libraries that we shouldn't be collecting.

@rokm
Copy link
Member

rokm commented Feb 2, 2024

If you are using onedir mode, can you try removing libcuda.so.1 from the frozen application directory?

Alternatively, you can add r'libcuda\.so(\..*)?' to this list in your copy of PyInstaller (and rebuild the application with --clean option):

_unix_excludes = {
r'libc\.so(\..*)?',
r'libdl\.so(\..*)?',
r'libm\.so(\..*)?',
r'libpthread\.so(\..*)?',
r'librt\.so(\..*)?',
r'libthread_db\.so(\..*)?',
# glibc regex excludes.
r'ld-linux\.so(\..*)?',
r'libBrokenLocale\.so(\..*)?',
r'libanl\.so(\..*)?',
r'libcidn\.so(\..*)?',
r'libcrypt\.so(\..*)?',
r'libnsl\.so(\..*)?',
r'libnss_compat.*\.so(\..*)?',
r'libnss_dns.*\.so(\..*)?',
r'libnss_files.*\.so(\..*)?',
r'libnss_hesiod.*\.so(\..*)?',
r'libnss_nis.*\.so(\..*)?',
r'libnss_nisplus.*\.so(\..*)?',
r'libresolv\.so(\..*)?',
r'libutil\.so(\..*)?',
# graphical interface libraries come with graphical stack (see libglvnd)
r'libE?(Open)?GLX?(ESv1_CM|ESv2)?(dispatch)?\.so(\..*)?',
r'libdrm\.so(\..*)?',
# a subset of libraries included as part of the Nvidia Linux Graphics Driver as of 520.56.06:
# https://download.nvidia.com/XFree86/Linux-x86_64/520.56.06/README/installedcomponents.html
r'nvidia_drv\.so',
r'libglxserver_nvidia\.so(\..*)?',
r'libnvidia-egl-(gbm|wayland)\.so(\..*)?',
r'libnvidia-(cfg|compiler|e?glcore|glsi|glvkspirv|rtcore|allocator|tls|ml)\.so(\..*)?',
r'lib(EGL|GLX)_nvidia\.so(\..*)?',
# libxcb-dri changes ABI frequently (e.g.: between Ubuntu LTS releases) and is usually installed as dependency of
# the graphics stack anyway. No need to bundle it.
r'libxcb\.so(\..*)?',
r'libxcb-dri.*\.so(\..*)?',
}

Does that fix the problem?

@jiaqizhang123-stack
Copy link
Author

image
When I test on GTX 4090, I have this problem again, checking the files inside onedir mode, there is no libcudnn ops infer.so.8. library in it, on gtx1650, it is able to run, the gtx4090 reports an error!

When I use onedir mode, removing libcuda.so.1 is ok, onefile mode is not tested yet.

@rokm
Copy link
Member

rokm commented Feb 3, 2024

If this is still torch 1.13.1+cu117, then originally, the libcudnn_ops_infer.so.8 is in site-packages/torch/lib/libcudnn_ops_infer.so.8 - and thus should be collected to torch/lib/libcudnn_ops_infer.so.8.

Can you check what are the contents of torch/lib directory in onedir frozen application, and paste the list of files here?

@jiaqizhang123-stack
Copy link
Author

image

@rokm
Copy link
Member

rokm commented Feb 3, 2024

Hmmm... in that case, can you rebuild again (with --clean option), and watch the build log for line

...
INFO: Loading module hook 'hook-torch.py' from '/path/to/hook'...
...

to see where the torch hook is loaded from; then open that file, and paste its contents here.

@jiaqizhang123-stack
Copy link
Author

image

@rokm
Copy link
Member

rokm commented Feb 3, 2024

What are contents of the hook-torch.py file in that directory?

@jiaqizhang123-stack
Copy link
Author

image

@rokm
Copy link
Member

rokm commented Feb 3, 2024

That's not from pyinstaller-hooks-contrib 2024.0 - which should look like this.

Your hook file seems to be from 2023.10 or earlier.


Although since it is an old brute-force hook that collects whole torch directory, it should collect the cudnn libs from torch/lib as well (2023.11 would likely fail to collect versioned .so.8 files, while 2023.12 and 2024.0 should be OK again).

So next question is, what are the contents of the site-packages/torch/lib directory? Are libcudnn_*.so.8 files in there in the first place?

(If you were not rebuilding with --clean before, did clean build perhaps collect the cudnn files in the frozen application?)

@jiaqizhang123-stack
Copy link
Author

image
libcudnn_*.so.8 files in there are not in the first place

@rokm
Copy link
Member

rokm commented Feb 3, 2024

Is this conda-installed torch, then? At the beginning, you said you downloaded .whl files - from where?

@jiaqizhang123-stack
Copy link
Author

I installed it using the pip .whl file, Not installed with conda

@rokm
Copy link
Member

rokm commented Feb 3, 2024

Where did you get the .whl file from?

@jiaqizhang123-stack
Copy link
Author

Sorry, I think I installed this torch with conda when I created the environment. I updated pyinstaller-hooks-contrib.libcudnn_*.so.8 files in there are not in the first place
image
image

@rokm
Copy link
Member

rokm commented Feb 3, 2024

The hook does not really support conda-installed torch. Because in that case, CUDA and cuDNN libs are not part of the torch package, and would need to be collected from conda environment. But we can automatically collect only the ones that are link-time dependencies (which are picked up by our binary dependency analysis), while the ones that are dynamically loaded at run-time (like the cuDNN libs) are not automatically collected. You'd need to manually ensure they are collected (for example, via --add-binary) or manually copy them into (onedir) frozen application's top-level directory.

@rokm
Copy link
Member

rokm commented Feb 3, 2024

Hmm, actually, looks like cuDNN libs are part of torch conda package (installed as conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia as per https://pytorch.org/get-started/previous-versions/):

$ ls -l /home/rok/miniconda3/envs/pyi-torch/lib/python3.9/site-packages/torch/lib
total 2268760
-rwxr-xr-x. 2 rok rok    415264 Dec  8  2022 libc10_cuda.so
-rwxr-xr-x. 2 rok rok    882160 Dec  8  2022 libc10.so
-rwxr-xr-x. 2 rok rok     17544 Dec  8  2022 libcaffe2_nvrtc.so
-rwxr-xr-x. 2 rok rok 113856992 Dec  8  2022 libcudnn_adv_infer.so.8
-rwxr-xr-x. 2 rok rok  95519224 Dec  8  2022 libcudnn_adv_train.so.8
-rwxr-xr-x. 2 rok rok 458532072 Dec  8  2022 libcudnn_cnn_infer.so.8
-rwxr-xr-x. 2 rok rok  74242352 Dec  8  2022 libcudnn_cnn_train.so.8
-rwxr-xr-x. 2 rok rok  91406344 Dec  8  2022 libcudnn_ops_infer.so.8
-rwxr-xr-x. 2 rok rok  67609952 Dec  8  2022 libcudnn_ops_train.so.8
-rwxr-xr-x. 2 rok rok    163040 Dec  8  2022 libcudnn.so.8
-rwxr-xr-x. 2 rok rok   7104344 Dec  8  2022 libcupti-6ac96871.so.11.7
-rwxr-xr-x. 2 rok rok     44656 Dec  8  2022 libshm.so
-rwxr-xr-x. 2 rok rok 306617864 Dec  8  2022 libtorch_cpu.so
-rwxr-xr-x. 2 rok rok 149299960 Dec  8  2022 libtorch_cuda_cpp.so
-rwxr-xr-x. 2 rok rok 852888760 Dec  8  2022 libtorch_cuda_cu.so
-rwxr-xr-x. 2 rok rok  82386400 Dec  8  2022 libtorch_cuda_linalg.so
-rwxr-xr-x. 2 rok rok    166632 Dec  8  2022 libtorch_cuda.so
-rwxr-xr-x. 2 rok rok     15696 Dec  8  2022 libtorch_global_deps.so
-rwxr-xr-x. 2 rok rok  21984128 Dec  8  2022 libtorch_python.so
-rwxr-xr-x. 2 rok rok     15480 Dec  8  2022 libtorch.so

and if I freeze a test torch program in that miniconda environment, they all end up collected:

ls -l dist/program/_internal/torch/lib
total 2268760
-rwxr-xr-x. 1 rok rok    415264 Feb  3 13:08 libc10_cuda.so
-rwxr-xr-x. 1 rok rok    882160 Feb  3 13:08 libc10.so
-rwxr-xr-x. 1 rok rok     17544 Feb  3 13:08 libcaffe2_nvrtc.so
-rwxr-xr-x. 1 rok rok 113856992 Feb  3 13:08 libcudnn_adv_infer.so.8
-rwxr-xr-x. 1 rok rok  95519224 Feb  3 13:08 libcudnn_adv_train.so.8
-rwxr-xr-x. 1 rok rok 458532072 Feb  3 13:08 libcudnn_cnn_infer.so.8
-rwxr-xr-x. 1 rok rok  74242352 Feb  3 13:08 libcudnn_cnn_train.so.8
-rwxr-xr-x. 1 rok rok  91406344 Feb  3 13:08 libcudnn_ops_infer.so.8
-rwxr-xr-x. 1 rok rok  67609952 Feb  3 13:08 libcudnn_ops_train.so.8
-rwxr-xr-x. 1 rok rok    163040 Feb  3 13:08 libcudnn.so.8
-rwxr-xr-x. 1 rok rok   7104344 Feb  3 13:08 libcupti-6ac96871.so.11.7
-rwxr-xr-x. 1 rok rok     44656 Feb  3 13:08 libshm.so
-rwxr-xr-x. 1 rok rok 306617864 Feb  3 13:08 libtorch_cpu.so
-rwxr-xr-x. 1 rok rok 149299960 Feb  3 13:08 libtorch_cuda_cpp.so
-rwxr-xr-x. 1 rok rok 852888760 Feb  3 13:08 libtorch_cuda_cu.so
-rwxr-xr-x. 1 rok rok  82386400 Feb  3 13:08 libtorch_cuda_linalg.so
-rwxr-xr-x. 1 rok rok    166632 Feb  3 13:08 libtorch_cuda.so
-rwxr-xr-x. 1 rok rok     15696 Feb  3 13:08 libtorch_global_deps.so
-rwxr-xr-x. 1 rok rok  21984128 Feb  3 13:08 libtorch_python.so
-rwxr-xr-x. 1 rok rok     15480 Feb  3 13:08 libtorch.so

@jiaqizhang123-stack
Copy link
Author

Once the above problems were solved, new problems arose
779bde26d67c51770d1019d59b540b0

@rokm
Copy link
Member

rokm commented Feb 4, 2024

This looks like external cuDNN (in /home/zhang/cuda/lib64) being mixed with cuDNN that was collected into frozen application. Is libcudnn_cnn_train.so.8 collected in _internal/torch/libs? And is there a symbolic link to that file in the top-level _internal directory?

What happens if you temporarily remove /home/zhang/cuda/lib64 from LD_LIBRARY_PATH in the target environment?

@jiaqizhang123-stack
Copy link
Author

If I temporarily remove /home/zhang/cuda/lib64 from LD_LIBRARY_PATH in the target environment,the program can run,

@jiaqizhang123-stack
Copy link
Author

image
image
There are also conflicts between _internal and torch/lib, and _internal's internal files seem to from cudnn

@rokm
Copy link
Member

rokm commented Feb 4, 2024

Hmmm, yeah, those libcudnn_* files in top-level _internal directory should be symbolic links to their counterparts in _internal/toch/lib:

ls -l dist/program/_internal | grep libcudnn
lrwxrwxrwx.  1 rok rok      33 feb  3 19:53 libcudnn_adv_infer.so.8 -> torch/lib/libcudnn_adv_infer.so.8
lrwxrwxrwx.  1 rok rok      33 feb  3 19:53 libcudnn_cnn_infer.so.8 -> torch/lib/libcudnn_cnn_infer.so.8
lrwxrwxrwx.  1 rok rok      33 feb  3 19:53 libcudnn_ops_infer.so.8 -> torch/lib/libcudnn_ops_infer.so.8
lrwxrwxrwx.  1 rok rok      33 feb  3 19:53 libcudnn_ops_train.so.8 -> torch/lib/libcudnn_ops_train.so.8
lrwxrwxrwx.  1 rok rok      23 feb  3 19:53 libcudnn.so.8 -> torch/lib/libcudnn.so.8

Can you open build/<name>/Analysis-00.toc and search for all instances of for example libcudnn_adv_infer.so.8, and check where they were collected from?

Do you also have an external CUDA toolkit in LD_LIBRARY_PATH on the build system (similarly to how you had it on the target system in /home/zhang/cuda/lib64? If so, can you try temporarily removing it from LD_LIBRARY_PATH and rebuild the program?

@jiaqizhang123-stack
Copy link
Author

I deleted this file, but I compared it with the file under cudnn and it's the same size
I does not have an external CUDA toolkit in LD_LIBRARY_PATH on the build system, how can I avoid prioritizing calls to external environments when running on the target machine?
image

@rokm
Copy link
Member

rokm commented Feb 4, 2024

how can I avoid prioritizing calls to external environments when running on the target machine?

You cannot, that's the problem.

I does not have an external CUDA toolkit in LD_LIBRARY_PATH on the build system

Do you have CUDA and cuDNN installed in the conda environment, then (for example, it would be installed via conda if this is the same environment that you previously had conda-installed torch in).

@rokm
Copy link
Member

rokm commented Feb 4, 2024

how can I avoid prioritizing calls to external environments when running on the target machine?

You cannot, that's the problem.

I.e., the binary dependency analysis on Linux tries to resolve the shared lib dependencies via ldd, and that probably ends up pulling in the external environment (and only if that fails, we try the parent directory as a work-around). I plan to rework this a bit (in particular, if binary was already discovered via hooks, that should be accounted for in binary dependency analysis). And for this particular case, I would like to figure out what exactly is going on, so I can have a local test case to work with.

@rokm
Copy link
Member

rokm commented Feb 4, 2024

Uh, wait, just to reconfirm - are you now using conda-installed or pip-installed torch?

@jiaqizhang123-stack
Copy link
Author

jiaqizhang123-stack commented Feb 4, 2024 via email

@rokm
Copy link
Member

rokm commented Feb 4, 2024

OK, I've tried to build and run the following example program

# train_program.py
from ultralytics import YOLO

model = YOLO('yolov8n.yaml').load('yolov8n.pt')  # build from YAML and transfer weights
results = model.train(data='coco128.yaml', epochs=2, imgsz=640)

with system-installed python 3.9.18 (i.e., no conda) with clean virtual environment and pip-installed torch==1.13.1+cu117 and ultralytics==8.0.81. (For easier debugging, this is onedir build).

The libcu* files in the top-level _internal directory are symlinks:

$ls -l dist/train_program/_internal/ | grep libcu
lrwxrwxrwx.  1 rok rok      27 feb  4 16:45 libcublasLt.so.11 -> torch/lib/libcublasLt.so.11
lrwxrwxrwx.  1 rok rok      25 feb  4 16:45 libcublas.so.11 -> torch/lib/libcublas.so.11
lrwxrwxrwx.  1 rok rok      43 feb  4 16:45 libcudart.782fcab0.so.11.0 -> torchvision.libs/libcudart.782fcab0.so.11.0
lrwxrwxrwx.  1 rok rok      36 feb  4 16:45 libcudart-e409450e.so.11.0 -> torch/lib/libcudart-e409450e.so.11.0
lrwxrwxrwx.  1 rok rok      33 feb  4 16:45 libcudnn_adv_infer.so.8 -> torch/lib/libcudnn_adv_infer.so.8
lrwxrwxrwx.  1 rok rok      33 feb  4 16:45 libcudnn_cnn_infer.so.8 -> torch/lib/libcudnn_cnn_infer.so.8
lrwxrwxrwx.  1 rok rok      33 feb  4 16:45 libcudnn_ops_infer.so.8 -> torch/lib/libcudnn_ops_infer.so.8
lrwxrwxrwx.  1 rok rok      33 feb  4 16:45 libcudnn_ops_train.so.8 -> torch/lib/libcudnn_ops_train.so.8
lrwxrwxrwx.  1 rok rok      23 feb  4 16:45 libcudnn.so.8 -> torch/lib/libcudnn.so.8

to their copies in the torch/lib directory:

$ ls -l dist/train_program/_internal/torch/lib
total 3169480
-rwxr-xr-x. 1 rok rok   1230457 feb  4 16:45 libc10_cuda.so
-rwxr-xr-x. 1 rok rok    878057 feb  4 16:45 libc10.so
-rwxr-xr-x. 1 rok rok     25521 feb  4 16:45 libcaffe2_nvrtc.so
-rwxr-xr-x. 1 rok rok 348150584 feb  4 16:45 libcublasLt.so.11
-rwxr-xr-x. 1 rok rok 156720544 feb  4 16:45 libcublas.so.11
-rwxr-xr-x. 1 rok rok    687321 feb  4 16:45 libcudart-e409450e.so.11.0
-rwxr-xr-x. 1 rok rok 113856969 feb  4 16:45 libcudnn_adv_infer.so.8
-rwxr-xr-x. 1 rok rok  95519201 feb  4 16:45 libcudnn_adv_train.so.8
-rwxr-xr-x. 1 rok rok 458532049 feb  4 16:45 libcudnn_cnn_infer.so.8
-rwxr-xr-x. 1 rok rok  74242329 feb  4 16:45 libcudnn_cnn_train.so.8
-rwxr-xr-x. 1 rok rok  91406321 feb  4 16:45 libcudnn_ops_infer.so.8
-rwxr-xr-x. 1 rok rok  67609937 feb  4 16:45 libcudnn_ops_train.so.8
-rwxr-xr-x. 1 rok rok    150200 feb  4 16:45 libcudnn.so.8
-rwxr-xr-x. 1 rok rok    168721 feb  4 16:45 libgomp-a34b3233.so.1
-rwxr-xr-x. 1 rok rok   7079489 feb  4 16:45 libnvrtc-builtins.so.11.7
-rwxr-xr-x. 1 rok rok  45791369 feb  4 16:45 libnvrtc-d833c4f3.so.11.2
-rwxr-xr-x. 1 rok rok     43681 feb  4 16:45 libnvToolsExt-847d78f2.so.1
-rwxr-xr-x. 1 rok rok     44560 feb  4 16:45 libshm.so
-rwxr-xr-x. 1 rok rok 539370033 feb  4 16:45 libtorch_cpu.so
-rwxr-xr-x. 1 rok rok 264587265 feb  4 16:45 libtorch_cuda_cpp.so
-rwxr-xr-x. 1 rok rok 741290633 feb  4 16:45 libtorch_cuda_cu.so
-rwxr-xr-x. 1 rok rok 214105800 feb  4 16:45 libtorch_cuda_linalg.so
-rwxr-xr-x. 1 rok rok    166536 feb  4 16:45 libtorch_cuda.so
-rwxr-xr-x. 1 rok rok     20817 feb  4 16:45 libtorch_global_deps.so
-rwxr-xr-x. 1 rok rok  23745625 feb  4 16:45 libtorch_python.so
-rwxr-xr-x. 1 rok rok     15384 feb  4 16:45 libtorch.so

(note that not all libcu* files are symlinked to top-level _internal directory).

If I try to run it (on the build machine), it ends up crashing when starting first training epoch, with

Plotting labels to runs/detect/train9/labels.jpg... 
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/detect/train9
Starting training for 2 epochs...

      Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size
  0%|          | 0/8 [00:00<?, ?it/s]Could not load library libcudnn_cnn_train.so.8. Error: libcudnn_cnn_train.so.8: cannot open shared object file: No such file or directory

This one seems to be caused by libcudnn.so.8 being symlinked to top-level _internal, while libcudnn_cnn_train.so.8 is not symlinked (which happens because it was not a link-time dependency for any collected binary).

Either removing libcudnn.so.8 symlink from _internal, or adding libcudnn_cnn_train.so.8 -> torch/lib/libcudnn_cnn_train.so.8 symlink to _internal seems to fix this particular problem.


  1. Can you try building the above test program in your environment? Does it run as expected (on the build machine, and on the target machine)?

  2. Are libcu* files in _internal directory hard copies or symlinks? (Please post the output of ls -l commands ran from linux on the build machine, instead of looking at files in Windows explorer).

  3. Can archive the whole build/train_program directory for the above test program (or at least build/train_program/Analysis-00.toc file) and upload it somewhere so I can download it and take a look at it?

Based on the screenshots you've provided so far, it indeed looks like that a second set of CUDA/cuDNN libraries is collected from somewhere (hence libcu* files in your top-level _internal directory are not symlinks), and we first need to figure out where they came from.

@jiaqizhang123-stack
Copy link
Author

jiaqizhang123-stack commented Feb 5, 2024

Does it run as expected (on the build machine, and on the target machine)?

If I run it on the target machine, the error is that
1707089060482
Probably because I reinstalled cuda11.7 using the cudnn,

cp /media/newdata1/zjq/cudnn-linux-x86_64-8.9.7.29_cuda11-archive/include/cudnn*.h //media/newdata1/zjq/cuda-11.7/targets/x86_64-linux/include/ 
cp -P /media/newdata1/zjq/cudnn-linux-x86_64-8.9.7.29_cuda11-archive/lib/libcudnn* /media/newdata1/zjq/cuda-11.7/targets/x86_64-linux/lib/
chmod a+r /media/newdata1/zjq/cudnn-linux-x86_64-8.9.7.29_cuda11-archive/include/cudnn*.h /media/newdata1/zjq/cuda-11.7/targets/x86_64-linux/lib/libcudnn*

When I remove the cudnn, the error is that
When I remove the cuda on the target machine, it works fine. But the value of the loss function for training is nan, which is something I can't figure out
b4b7c0c407197dff776314cb68ffcef

The libcu* files in the top-level _internal directory are symlinks:

ls -l dist_gitce/train/_internal/ | grep libcu
lrwxrwxrwx  1 zjq zjq        33 2月   5 14:51 libcublas.so.11 -> nvidia/cublas/lib/libcublas.so.11
-rwxr-xr-x  1 zjq zjq 348150584 2月   5 14:51 libcublasLt.so.11
lrwxrwxrwx  1 zjq zjq        36 2月   5 14:51 libcudart-e409450e.so.11.0 -> torch/lib/libcudart-e409450e.so.11.0
lrwxrwxrwx  1 zjq zjq        43 2月   5 14:51 libcudart.782fcab0.so.11.0 -> torchvision.libs/libcudart.782fcab0.so.11.0
-rwxr-xr-x  1 zjq zjq    671072 2月   5 14:51 libcudart.so.11.0
lrwxrwxrwx  1 zjq zjq        30 2月   5 14:51 libcudnn.so.8 -> nvidia/cudnn/lib/libcudnn.so.8
-rwxr-xr-x  1 zjq zjq 125384784 2月   5 14:51 libcudnn_adv_infer.so.8
-rwxr-xr-x  1 zjq zjq 563283840 2月   5 14:51 libcudnn_cnn_infer.so.8
-rwxr-xr-x  1 zjq zjq  90849728 2月   5 14:51 libcudnn_ops_infer.so.8
-rwxr-xr-x  1 zjq zjq  71053560 2月   5 14:51 libcudnn_ops_train.so.8

to their copies in the torch/lib directory:

ls -l dist_gitce/train/_internal/torch/lib | grep libcu
-rwxr-xr-x 1 zjq zjq    700096 2月   5 14:50 libcudart-e409450e.so.11.0

Since I pip install torch-1.13.1-cp39-cp39-manylinux1_x86_64.whl, I installed the torch along with the

nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
ls -l dist_gitce/train/_internal/nvidia/cudnn/lib | grep libcu
-rwxr-xr-x 1 zjq zjq    150200 2月   5 14:51 libcudnn.so.8
-rwxr-xr-x 1 zjq zjq 113856992 2月   5 14:51 libcudnn_adv_infer.so.8
-rwxr-xr-x 1 zjq zjq  95519224 2月   5 14:51 libcudnn_adv_train.so.8
-rwxr-xr-x 1 zjq zjq 458532080 2月   5 14:51 libcudnn_cnn_infer.so.8
-rwxr-xr-x 1 zjq zjq  74242352 2月   5 14:51 libcudnn_cnn_train.so.8
-rwxr-xr-x 1 zjq zjq  91406344 2月   5 14:51 libcudnn_ops_infer.so.8
-rwxr-xr-x 1 zjq zjq  67609960 2月   5 14:51 libcudnn_ops_train.so.8

Analysis-00.zip

@jiaqizhang123-stack
Copy link
Author

When I remove the cudnn, the error is that
1707090385817

@rokm
Copy link
Member

rokm commented Feb 5, 2024

OK, so now you've switched to torch-1.31.1 from PyPI, where torch itself does not bundle cuDNN shared libs in torch/lib directory, but depends on nvidia-cudnn to provide them instead (and same for other CUDA libs).

If I build the test training program using this version, it seems to work fine out of the box, without having to remove libcudnn.so.8 symlink from _internal, or add libcudnn_cnn_train.so.8 -> nvidia/cudnn/lib/libcudnn_cnn_train.so.8 symlink.


If I add external cuDNN to LD_LIBRARY_PATH (and that cuDNN is not specifically of compatible version) before the build, then I end up with hard copies of libcudnn_*.so.8 files in _internal (collected from external directory) instead of symlinks to nvidia/cudnn/lib/libcudnn_*.so.8 (which, according to your Analysis-00.toc, is what also happensi n your case). And because the external cuDNN version does not match the version used by torch, and results in an error similar to yours.

I'll need to check where and how this dependency leaks occurs, and what we can do about it. For now, the only way you can work around it is to ensure that you don't have external CUDA/cuDNN in LD_LIBRARY_PATH (or in standard /usr/lib) when running PyInstaller.


If I take the initial build (the one I said worked out of the box) and run it in environment that has external cuDNN in LD_LIBRARY_PATH, it also crashes due to mix of incompatible versions (specifically, it seems that external libcudnn_cnn_train.so.8 is being loaded, but cannot be because of missing/incompatible symbols).

Aside from removing external cuDNN from LD_LIBRARY_PATH, it seems this can also be fixed by adding libcudnn_cnn_train.so.8 -> nvidia/cudnn/lib/libcudnn_cnn_train.so.8 to _internal. (To be absolute sure, you can also add one for libcudnn_adv_train.so.8).

For now you need to do this manually, but eventually, our hooks will be able to add these missing symlinks automatically.


So to summarize:

  • you should ensure that you don't have external CUDA/cuDNN in LD_LIBRARY_PATH when building the application
  • you should add libcudnn_cnn_train.so.8 -> nvidia/cudnn/lib/libcudnn_cnn_train.so.8 symlink to _internal directory (if you want to do this for onefile build, you will have to add a.datas += [('libcudnn_cnn_train.so.8', 'nvidia/cudnn/lib/libcudnn_cnn_train.so.8', 'SYMLINK')] to the .spec file after Analysis is instantiated and before its fields are passed on to EXE).
  • you should remove libcuda.so.1 from _internal to make it portable across different driver versions (for onefile, you will need to filter a.binaries in the .spec file).

When I remove the cuda on the target machine, it works fine. But the value of the loss function for training is nan, which is something I can't figure out

I cannot really help with this, as I cannot reproduce the problem (without the code and data you are using), which might or might not be related to other issues we've seen here.

@jiaqizhang123-stack
Copy link
Author

Ok,thank you so much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage Please triage and relabel this issue
Projects
None yet
Development

No branches or pull requests

2 participants