Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid Device Context and Seg Fault with UCX+MPI+PyTorch #9498

Open
snarayan21 opened this issue Nov 18, 2023 · 5 comments
Open

Invalid Device Context and Seg Fault with UCX+MPI+PyTorch #9498

snarayan21 opened this issue Nov 18, 2023 · 5 comments
Labels

Comments

@snarayan21
Copy link

snarayan21 commented Nov 18, 2023

Describe the bug

I'm using CUDA-aware OpenMPI that uses UCX (from one of NVIDIA's PyTorch images, which has UCX installed as part of HPC-X) to perform collectives between GPUs. I'm consistently running into the error below and have been unable to solve it. Solutions I have tried:

  • change UCX_TLS environment variable
  • explicitly setting torch.cuda.set_device(rank)
  • reinstalling UCX with gdrcopy (which seems to not be recognized when running ucx_info -d)

I'm not sure what would be going wrong and would greatly appreciate assistance here!

Error message and stack trace:

[1700266166.002539] [e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999:0]    cuda_copy_md.c:341  UCX  ERROR cuMemGetAddressRange(0x7f5b05e00000) error: invalid device context
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999:0:768657] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x7f5b05e00000)
==== backtrace (tid: 768657) ====
 0  /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f5c1eae82b4]
 1  /opt/hpcx/ucx/lib/libucs.so.0(+0x304af) [0x7f5c1eae84af]
 2  /opt/hpcx/ucx/lib/libucs.so.0(+0x30796) [0x7f5c1eae8796]
 3  /lib/x86_64-linux-gnu/libc.so.6(+0x1a6a72) [0x7f605c51ca72]
 4  /opt/hpcx/ucx/lib/libuct.so.0(uct_mm_ep_am_short+0x93) [0x7f5c1ea959e3]
 5  /opt/hpcx/ucx/lib/libucp.so.0(+0x8ee9d) [0x7f5c1ebade9d]
 6  /opt/hpcx/ucx/lib/libucp.so.0(ucp_tag_send_nbx+0x735) [0x7f5c1ebb9365]
 7  /opt/hpcx/ompi/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_isend+0xab) [0x7f5c2403627b]
 8  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x14b) [0x7f605bb4582b]
 9  /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_bintree+0xc2) [0x7f605bb45ed2]
10  /opt/hpcx/ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x40) [0x7f5c1deb0840]
11  /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Bcast+0x41) [0x7f605bb20841]
12  /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(+0x4eed8a4) [0x7f600c8168a4]
13  /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(_ZN4c10d15ProcessGroupMPI7runLoopEv+0xb2) [0x7f600c81d3d2]
14  /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f5fb34b0253]
15  /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f605c40aac3]
16  /lib/x86_64-linux-gnu/libc.so.6(+0x126a40) [0x7f605c49ca40]
=================================
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] *** Process received signal ***
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] Signal: Segmentation fault (11)
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] Signal code:  (-6)
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] Failing at address: 0xbb417
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f605c3b8520]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x1a6a72)[0x7f605c51ca72]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 2] /opt/hpcx/ucx/lib/libuct.so.0(uct_mm_ep_am_short+0x93)[0x7f5c1ea959e3]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 3] /opt/hpcx/ucx/lib/libucp.so.0(+0x8ee9d)[0x7f5c1ebade9d]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 4] /opt/hpcx/ucx/lib/libucp.so.0(ucp_tag_send_nbx+0x735)[0x7f5c1ebb9365]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 5] /opt/hpcx/ompi/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_isend+0xab)[0x7f5c2403627b]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 6] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x14b)[0x7f605bb4582b]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 7] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_bintree+0xc2)[0x7f605bb45ed2]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 8] /opt/hpcx/ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x40)[0x7f5c1deb0840]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [ 9] /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Bcast+0x41)[0x7f605bb20841]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [10] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(+0x4eed8a4)[0x7f600c8168a4]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [11] /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cpu.so(_ZN4c10d15ProcessGroupMPI7runLoopEv+0xb2)[0x7f600c81d3d2]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [12] /lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253)[0x7f5fb34b0253]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [13] /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f605c40aac3]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] [14] /lib/x86_64-linux-gnu/libc.so.6(+0x126a40)[0x7f605c49ca40]
[e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999] *** End of error message ***

Steps to Reproduce

  • ran mpirun --allow-run-as-root -np 8 python myscript.py
  • UCX version used: 1.15.0
  • UCX configure flags: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --without-java --enable-devel-headers --with-cuda=/usr/local/cuda --with-gdrcopy=/workspace --prefix=/opt/hpcx/ucx
  • Any UCX environment variables used: UCX_TLS = cma,cuda,cuda_copy,cuda_ipc,mm,posix,self,shm,sm,sysv,tcp

Setup and versions

  • OS version + CPU architecture
    • Ubuntu 22.04.3 LTS on x86_64 GNU/Linux
  • For GPU related issues:
    • GPU type: H100-80GB
    • Cuda:
      • Drivers version: 535.86.10
      • Check if peer-direct is loaded: The lsmod|grep gdrdrv gives me:
        gdrdrv 24576 0
        nvidia 56512512 523 nvidia_uvm,nvidia_peermem,gdrdrv,nvidia_modeset

Additional information (depending on the issue)

  • OpenMPI version: 4.1.5rc2
  • Output of ucx_info -d to show transports and devices recognized by UCX:
#
# Memory domain: self
#     Component: self
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#           rkey_ptr is supported
#         memory types: host (access,reg_nonblock,reg,cache)
#
#      Transport: self
#         Device: memory
#           Type: loopback
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 19360.00 MB/sec
#              latency: 0 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 8K
#             am_bcopy: <= 8K
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: tcp
#     Component: tcp
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#         memory types: host (access,reg_nonblock,reg,cache)
#
#      Transport: tcp
#         Device: eth0
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 1129.60/ppn + 0.00 MB/sec
#              latency: 5258 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 0
#     device num paths: 1
#              max eps: 256
#       device address: 6 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#      Transport: tcp
#         Device: lo
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 11.91/ppn + 0.00 MB/sec
#              latency: 10960 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 18 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#
# Connection manager: tcp
#      max_conn_priv: 2064 bytes
#
# Memory domain: sysv
#     Component: sysv
#             allocate: unlimited
#           remote key: 12 bytes
#           rkey_ptr is supported
#         memory types: host (access,alloc,cache)
#
#      Transport: sysv
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 15360.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: posix
#     Component: posix
#             allocate: <= 990751528K
#           remote key: 32 bytes
#           rkey_ptr is supported
#         memory types: host (access,alloc,cache)
#
#      Transport: posix
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 15360.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 16 bytes
#       error handling: ep_check
#
#
# Memory domain: cuda_cpy
#     Component: cuda_cpy
#             allocate: unlimited
#             register: unlimited, cost: 0 nsec
#         memory types: host (reg), cuda (access,alloc,reg,cache,detect), cuda-managed (access,alloc,reg,cache,detect)
#
#      Transport: cuda_copy
#         Device: cuda
#           Type: accelerator
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 10000.00/ppn + 0.00 MB/sec
#              latency: 8000 nsec
#             overhead: 0 nsec
#            put_short: <= 4294967295
#            put_zcopy: unlimited, up to 1 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_short: <= 4294967295
#            get_zcopy: unlimited, up to 1 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: none
#
#
# Memory domain: cuda_ipc
#     Component: cuda_ipc
#             register: unlimited, cost: 0 nsec
#           remote key: 112 bytes
#           memory invalidation is supported
#         memory types: cuda (access,reg,cache)
#
#      Transport: cuda_ipc
#         Device: cuda
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 400000.00/ppn + 0.00 MB/sec
#              latency: 1000 nsec
#             overhead: 7000 nsec
#            put_zcopy: unlimited, up to 1 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 1 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: peer failure, ep_check
#
#
# Memory domain: cma
#     Component: cma
#             register: unlimited, cost: 9 nsec
#         memory types: host (access,reg_nonblock,reg,cache)
#
#      Transport: cma
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 11145.00 MB/sec
#              latency: 80 nsec
#             overhead: 2000 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 16 bytes
#       error handling: peer failure, ep_check
# 
@snarayan21 snarayan21 added the Bug label Nov 18, 2023
@brminich
Copy link
Contributor

can you please also post the error itself?

@snarayan21
Copy link
Author

Knew I was forgetting something :) updated the description above!

@snarayan21
Copy link
Author

Do you think this may be due to using RoCE?

@Akshay-Venkatesh
Copy link
Contributor

@snarayan21 Can you post the output of ucx_info -v?

Is it the case that you're passing cudaMallocAsync memory or cuda VMM memory to the bcast operation? The following symptom is generally seen for Mallocasync/VMM memory:

[1700266166.002539] [e605721f-97d8-4187-aaec-50f5c08fd75a-0:766999:0]    cuda_copy_md.c:341  UCX  ERROR cuMemGetAddressRange(0x7f5b05e00000) error: invalid device context

Using MallocAsync memory is supported for v1.15.x but VMM memory isn't supported.

@snarayan21
Copy link
Author

snarayan21 commented Nov 27, 2023

Here's the output of ucx_info -v:

# Library version: 1.15.0
# Library path: /opt/hpcx/ucx/lib/libucs.so.0
# API headers version: 1.15.0
# Git branch '', revision bf8f1b6
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --with-xpmem=/hpc/local/oss/xpmem/v2.7.1 --without-java --enable-devel-headers --with-fuse3-static --with-cuda=/hpc/local/oss/cuda12.1.1 --with-gdrcopy --prefix=/build-result/hpcx-v2.16-gcc-inbox-ubuntu22.04-cuda12-gdrcopy2-nccl2.18-x86_64/ucx/mt --with-bfd=/hpc/local/oss/binutils/2.37

I'm not entirely sure -- I'm just using the UCC backend with PyTorch using the NVIDIA Pytorch images here: https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants