-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid Device Context and Seg Fault with UCX+MPI+PyTorch #9498
Comments
can you please also post the error itself? |
Knew I was forgetting something :) updated the description above! |
Do you think this may be due to using RoCE? |
@snarayan21 Can you post the output of Is it the case that you're passing cudaMallocAsync memory or cuda VMM memory to the bcast operation? The following symptom is generally seen for Mallocasync/VMM memory:
Using MallocAsync memory is supported for v1.15.x but VMM memory isn't supported. |
Here's the output of
I'm not entirely sure -- I'm just using the UCC backend with PyTorch using the NVIDIA Pytorch images here: https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html |
Describe the bug
I'm using CUDA-aware OpenMPI that uses UCX (from one of NVIDIA's PyTorch images, which has UCX installed as part of HPC-X) to perform collectives between GPUs. I'm consistently running into the error below and have been unable to solve it. Solutions I have tried:
torch.cuda.set_device(rank)
I'm not sure what would be going wrong and would greatly appreciate assistance here!
Error message and stack trace:
Steps to Reproduce
mpirun --allow-run-as-root -np 8 python myscript.py
--disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --without-java --enable-devel-headers --with-cuda=/usr/local/cuda --with-gdrcopy=/workspace --prefix=/opt/hpcx/ucx
UCX_TLS = cma,cuda,cuda_copy,cuda_ipc,mm,posix,self,shm,sm,sysv,tcp
Setup and versions
lsmod|grep gdrdrv
gives me:gdrdrv 24576 0
nvidia 56512512 523 nvidia_uvm,nvidia_peermem,gdrdrv,nvidia_modeset
Additional information (depending on the issue)
ucx_info -d
to show transports and devices recognized by UCX:The text was updated successfully, but these errors were encountered: