Invalid Device Context and Seg Fault with UCX+MPI+PyTorch #9498

snarayan21 · 2023-11-18T00:23:41Z

Describe the bug

I'm using CUDA-aware OpenMPI that uses UCX (from one of NVIDIA's PyTorch images, which has UCX installed as part of HPC-X) to perform collectives between GPUs. I'm consistently running into the error below and have been unable to solve it. Solutions I have tried:

change UCX_TLS environment variable
explicitly setting torch.cuda.set_device(rank)
reinstalling UCX with gdrcopy (which seems to not be recognized when running ucx_info -d)

I'm not sure what would be going wrong and would greatly appreciate assistance here!

Error message and stack trace:

[1700266166.002539] [e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999:0]    cuda_copy_md.c:341  UCX  ERROR cuMemGetAddressRange(0x7f5b05e00000) error: invalid device context
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999:0:768657] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f5b05e00000)
==== backtrace (tid: 768657) ====
 0  /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f5c1eae82b4]
 1  /opt/hpcx/ucx/lib/libucs.so.0(+0x304af) [0x7f5c1eae84af]
 2  /opt/hpcx/ucx/lib/libucs.so.0(+0x30796) [0x7f5c1eae8796]
 3  /lib/x86_64-linux-gnu/libc.so.6(+0x1a6a72) [0x7f605c51ca72]
 4  /opt/hpcx/ucx/lib/libuct.so.0(uct_mm_ep_am_short+0x93) [0x7f5c1ea959e3]
 5  /opt/hpcx/ucx/lib/libucp.so.0(+0x8ee9d) [0x7f5c1ebade9d]
 6  /opt/hpcx/ucx/lib/libucp.so.0(ucp_tag_send_nbx+0x735) [0x7f5c1ebb9365]
 7  /opt/hpcx/ompi/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_isend+0xab) [0x7f5c2403627b]
 8  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x14b) [0x7f605bb4582b]
 9  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_bintree+0xc2) [0x7f605bb45ed2]
10  /opt/hpcx/ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x40) [0x7f5c1deb0840]
11  /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Bcast+0x41) [0x7f605bb20841]
12  /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(+0x4eed8a4) [0x7f600c8168a4]
13  /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(_ZN4c10d15ProcessGroupMPI7runLoopEv+0xb2) [0x7f600c81d3d2]
14  /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f5fb34b0253]
15  /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f605c40aac3]
16  /lib/x86_64-linux-gnu/libc.so.6(+0x126a40) [0x7f605c49ca40]
=================================
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] *** Process received signal ***
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] Signal: Segmentation fault (11)
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] Signal code:  (-6)
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] Failing at address: 0xbb417
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f605c3b8520]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x1a6a72)[0x7f605c51ca72]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 2] /opt/hpcx/ucx/lib/libuct.so.0(uct_mm_ep_am_short+0x93)[0x7f5c1ea959e3]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 3] /opt/hpcx/ucx/lib/libucp.so.0(+0x8ee9d)[0x7f5c1ebade9d]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 4] /opt/hpcx/ucx/lib/libucp.so.0(ucp_tag_send_nbx+0x735)[0x7f5c1ebb9365]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 5] /opt/hpcx/ompi/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_isend+0xab)[0x7f5c2403627b]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 6] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x14b)[0x7f605bb4582b]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 7] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_bintree+0xc2)[0x7f605bb45ed2]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 8] /opt/hpcx/ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x40)[0x7f5c1deb0840]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 9] /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Bcast+0x41)[0x7f605bb20841]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [10] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(+0x4eed8a4)[0x7f600c8168a4]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [11] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(_ZN4c10d15ProcessGroupMPI7runLoopEv+0xb2)[0x7f600c81d3d2]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [12] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253)[0x7f5fb34b0253]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [13] /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f605c40aac3]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [14] /lib/x86_64-linux-gnu/libc.so.6(+0x126a40)[0x7f605c49ca40]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] *** End of error message ***

Steps to Reproduce

ran mpirun --allow-run-as-root -np 8 python myscript.py
UCX version used: 1.15.0
UCX configure flags: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --without-java --enable-devel-headers --with-cuda=/usr/local/cuda --with-gdrcopy=/workspace --prefix=/opt/hpcx/ucx
Any UCX environment variables used: UCX_TLS = cma,cuda,cuda_copy,cuda_ipc,mm,posix,self,shm,sm,sysv,tcp

Setup and versions

OS version + CPU architecture
- Ubuntu 22.04.3 LTS on x86_64 GNU/Linux
For GPU related issues:
- GPU type: H100-80GB
- Cuda:
  - Drivers version: 535.86.10
  - Check if peer-direct is loaded: The lsmod|grep gdrdrv gives me:
    gdrdrv 24576 0
    nvidia 56512512 523 nvidia_uvm,nvidia_peermem,gdrdrv,nvidia_modeset

Additional information (depending on the issue)

OpenMPI version: 4.1.5rc2
Output of ucx_info -d to show transports and devices recognized by UCX:

#
# Memory domain: self
#     Component: self
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#           rkey_ptr is supported
#         memory types: host (access,reg_nonblock,reg,cache)
#
#      Transport: self
#         Device: memory
#           Type: loopback
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 19360.00 MB/sec
#              latency: 0 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 8K
#             am_bcopy: <= 8K
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: tcp
#     Component: tcp
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#         memory types: host (access,reg_nonblock,reg,cache)
#
#      Transport: tcp
#         Device: eth0
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 1129.60/ppn + 0.00 MB/sec
#              latency: 5258 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 0
#     device num paths: 1
#              max eps: 256
#       device address: 6 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#      Transport: tcp
#         Device: lo
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 11.91/ppn + 0.00 MB/sec
#              latency: 10960 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 18 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#
# Connection manager: tcp
#      max_conn_priv: 2064 bytes
#
# Memory domain: sysv
#     Component: sysv
#             allocate: unlimited
#           remote key: 12 bytes
#           rkey_ptr is supported
#         memory types: host (access,alloc,cache)
#
#      Transport: sysv
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 15360.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: posix
#     Component: posix
#             allocate: <= 990751528K
#           remote key: 32 bytes
#           rkey_ptr is supported
#         memory types: host (access,alloc,cache)
#
#      Transport: posix
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 15360.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 16 bytes
#       error handling: ep_check
#
#
# Memory domain: cuda_cpy
#     Component: cuda_cpy
#             allocate: unlimited
#             register: unlimited, cost: 0 nsec
#         memory types: host (reg), cuda (access,alloc,reg,cache,detect), cuda-managed (access,alloc,reg,cache,detect)
#
#      Transport: cuda_copy
#         Device: cuda
#           Type: accelerator
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 10000.00/ppn + 0.00 MB/sec
#              latency: 8000 nsec
#             overhead: 0 nsec
#            put_short: <= 4294967295
#            put_zcopy: unlimited, up to 1 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_short: <= 4294967295
#            get_zcopy: unlimited, up to 1 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: none
#
#
# Memory domain: cuda_ipc
#     Component: cuda_ipc
#             register: unlimited, cost: 0 nsec
#           remote key: 112 bytes
#           memory invalidation is supported
#         memory types: cuda (access,reg,cache)
#
#      Transport: cuda_ipc
#         Device: cuda
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 400000.00/ppn + 0.00 MB/sec
#              latency: 1000 nsec
#             overhead: 7000 nsec
#            put_zcopy: unlimited, up to 1 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 1 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: peer failure, ep_check
#
#
# Memory domain: cma
#     Component: cma
#             register: unlimited, cost: 9 nsec
#         memory types: host (access,reg_nonblock,reg,cache)
#
#      Transport: cma
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 11145.00 MB/sec
#              latency: 80 nsec
#             overhead: 2000 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 16 bytes
#       error handling: peer failure, ep_check
#

The text was updated successfully, but these errors were encountered:

brminich · 2023-11-18T18:12:02Z

can you please also post the error itself?

snarayan21 · 2023-11-18T19:31:43Z

Knew I was forgetting something :) updated the description above!

snarayan21 · 2023-11-20T18:57:53Z

Do you think this may be due to using RoCE?

Akshay-Venkatesh · 2023-11-20T22:02:46Z

@snarayan21 Can you post the output of ucx_info -v?

Is it the case that you're passing cudaMallocAsync memory or cuda VMM memory to the bcast operation? The following symptom is generally seen for Mallocasync/VMM memory:

[1700266166.002539] [e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999:0]    cuda_copy_md.c:341  UCX  ERROR cuMemGetAddressRange(0x7f5b05e00000) error: invalid device context

Using MallocAsync memory is supported for v1.15.x but VMM memory isn't supported.

snarayan21 · 2023-11-27T18:23:53Z

Here's the output of ucx_info -v:

# Library version: 1.15.0
# Library path: /opt/hpcx/ucx/lib/libucs.so.0
# API headers version: 1.15.0
# Git branch '', revision bf8f1b6
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --with-xpmem=/hpc/local/oss/xpmem/v2.7.1 --without-java --enable-devel-headers --with-fuse3-static --with-cuda=/hpc/local/oss/cuda12.1.1 --with-gdrcopy --prefix=/build-result/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx/mt --with-bfd=/hpc/local/oss/binutils/2.37

I'm not entirely sure -- I'm just using the UCC backend with PyTorch using the NVIDIA Pytorch images here: https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html

snarayan21 added the Bug label Nov 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid Device Context and Seg Fault with UCX+MPI+PyTorch #9498

Invalid Device Context and Seg Fault with UCX+MPI+PyTorch #9498

snarayan21 commented Nov 18, 2023 •

edited

brminich commented Nov 18, 2023

snarayan21 commented Nov 18, 2023

snarayan21 commented Nov 20, 2023

Akshay-Venkatesh commented Nov 20, 2023

snarayan21 commented Nov 27, 2023 •

edited

Invalid Device Context and Seg Fault with UCX+MPI+PyTorch #9498

Invalid Device Context and Seg Fault with UCX+MPI+PyTorch #9498

Comments

snarayan21 commented Nov 18, 2023 • edited

Describe the bug

Steps to Reproduce

Setup and versions

Additional information (depending on the issue)

brminich commented Nov 18, 2023

snarayan21 commented Nov 18, 2023

snarayan21 commented Nov 20, 2023

Akshay-Venkatesh commented Nov 20, 2023

snarayan21 commented Nov 27, 2023 • edited

snarayan21 commented Nov 18, 2023 •

edited

snarayan21 commented Nov 27, 2023 •

edited