Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda IPC issue #1036

Open
pgrete opened this issue Mar 27, 2024 · 0 comments
Open

Cuda IPC issue #1036

pgrete opened this issue Mar 27, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@pgrete
Copy link
Collaborator

pgrete commented Mar 27, 2024

I was made aware of a Cuda IPC related issue when running a simple AthenaPK test case in parallel.
Given that the test case mostly exercises Parthenon base features, I'm opening an issue here to raise awareness.

The test setup works on a single rank and on two ranks but fails on 4 ranks on mesh refinement (so sometime not on the first one but on a subsequent one).
Also disabling ipc via

export UCX_TLS=rc_x,self,sm,gdr_copy,cuda_copy

i.e., removing cuda_ipc from that list also fixes the issue.

Might be worth trying to reproduce with the Parthenon advection-example and on other software stacks.
Realistically I'll first be able to have a closer look in about two weeks so if anyone else has ideas in between go for it! I was not able to reproduce on Frontier so might be a CUDA issue, ping @forrestglines

Steps to reproduce (using NVHPC/Cuda/OpenMPI 23.1/11.7.64/4.1.4 and 23.7/12.2.91/4.15 on A100s on JUWELS Booster) is compile current AthenaPK main and run the advection pgen

# works
$ srun -n 1 ../build-fresh/bin/athenaPK -i ../inputs/advection_3d.in 

# works
$ srun -n 2 ../build-fresh/bin/athenaPK -i ../inputs/advection_3d.in

# FAILS after mesh refinement
$ srun -n 4 ../build-fresh/bin/athenaPK -i ../inputs/advection_3d.in
cycle=51 time=5.2174263355214599e-02 dt=1.0230267049312646e-03 zone-cycles/wsec_step=7.38e+05 wsec_total=2.18e+00 wsec_step=1.22e-01 zone-cycles/wsec=2.36e+05 wsec_AMR=2.61e-01
-------------- New Mesh structure after (de)refinement -------------
Root grid = 4 x 4 x 4 MeshBlocks
Total number of MeshBlocks = 281
Number of physical refinement levels = 2
Number of logical  refinement levels = 4
  Physical level = 0 (logical level = 2): 44 MeshBlocks, cost = 44
  Physical level = 1 (logical level = 3): 149 MeshBlocks, cost = 149
  Physical level = 2 (logical level = 4): 88 MeshBlocks, cost = 88
--------------------------------------------------------------------
[1711532903.925045] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925089] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925093] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925096] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925099] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925102] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925105] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925107] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925110] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925121] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925127] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925133] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925138] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925145] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925150] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925156] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925162] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925167] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925172] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925180] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925187] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925193] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925199] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925206] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925212] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.925218] [jwb0033:30564:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[jwb0033.juwels:00000] *** An error occurred in MPI_Test
[jwb0033.juwels:00000] *** reported by process [2349416960,2]
[jwb0033.juwels:00000] *** on communicator MPI COMMUNICATOR 4 DUP FROM 0
[jwb0033.juwels:00000] *** MPI_ERR_INTERN: internal error
[jwb0033.juwels:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[jwb0033.juwels:00000] ***    and potentially your MPI job)
[jwb0033.juwels:30564] [0] func:/p/software/juwelsbooster/stages/2024/software/OpenMPI/4.1.5-NVHPC-23.7-CUDA-12/lib/libopen-pal.so.40(opal_backtrace_buffer+0x25) [0x14fac811eb25]
[jwb0033.juwels:30564] [1] func:/p/software/juwelsbooster/stages/2024/software/OpenMPI/4.1.5-NVHPC-23.7-CUDA-12/lib/libmpi.so.40(ompi_mpi_abort+0x8e) [0x14facb63604e]
[jwb0033.juwels:30564] [2] func:/p/software/juwelsbooster/stages/2024/software/OpenMPI/4.1.5-NVHPC-23.7-CUDA-12/lib/libmpi.so.40(ompi_mpi_errors_are_fatal_comm_handler+0x36f) [0x14facb6243ef]
[jwb0033.juwels:30564] [3] func:/p/software/juwelsbooster/stages/2024/software/OpenMPI/4.1.5-NVHPC-23.7-CUDA-12/lib/libmpi.so.40(ompi_errhandler_request_invoke+0x6f0) [0x14facb623f30]
[jwb0033.juwels:30564] [4] func:/p/software/juwelsbooster/stages/2024/software/OpenMPI/4.1.5-NVHPC-23.7-CUDA-12/lib/libmpi.so.40(PMPI_Test+0xb9) [0x14facb670879]
[jwb0033.juwels:30564] [5] func:../build-fresh-24/bin/athenaPK() [0x97c43d]
[jwb0033.juwels:30564] [6] func:../build-fresh-24/bin/athenaPK() [0x4b9bca]
[jwb0033.juwels:30564] [7] func:../build-fresh-24/bin/athenaPK() [0x4b4efa]
[jwb0033.juwels:30564] [8] func:../build-fresh-24/bin/athenaPK() [0x4b489b]
[jwb0033.juwels:30564] [9] func:/usr/lib64/libpthread.so.0(+0xfe67) [0x14facae03e67]
[jwb0033.juwels:30564] [10] func:../build-fresh-24/bin/athenaPK() [0x4b576e]
[jwb0033.juwels:30564] [11] func:../build-fresh-24/bin/athenaPK() [0x4b8a13]
[jwb0033.juwels:30564] [12] func:/p/software/juwelsbooster/stages/2024/software/GCCcore/12.3.0/lib64/libstdc++.so.6(+0xe0a93) [0x14fac898aa93]
[jwb0033.juwels:30564] [13] func:/usr/lib64/libpthread.so.0(+0x81ca) [0x14facadfc1ca]
[jwb0033.juwels:30564] [14] func:/usr/lib64/libc.so.6(clone+0x43) [0x14fac819ce73]
[1711532903.927606] [jwb0033:30662:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.927658] [jwb0033:30662:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.927669] [jwb0033:30662:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.927676] [jwb0033:30662:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.927682] [jwb0033:30662:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.927689] [jwb0033:30662:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.927694] [jwb0033:30662:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.927700] [jwb0033:30662:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.927706] [jwb0033:30662:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.927711] [jwb0033:30662:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.927717] [jwb0033:30662:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[1711532903.927723] [jwb0033:30662:0]  cuda_ipc_cache.c:158  UCX  ERROR cuCtxGetDevice(&key.cu_device)() failed: invalid device context
[jwb0033.juwels:00000] *** An error occurred in MPI_Test
[jwb0033.juwels:00000] *** reported by process [2349416960,1]
[jwb0033.juwels:00000] *** on communicator MPI COMMUNICATOR 4 DUP FROM 0
[jwb0033.juwels:00000] *** MPI_ERR_INTERN: internal error
[jwb0033.juwels:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[jwb0033.juwels:00000] ***    and potentially your MPI job)
[jwb0033.juwels:30662] [0] func:/p/software/juwelsbooster/stages/2024/software/OpenMPI/4.1.5-NVHPC-23.7-CUDA-12/lib/libopen-pal.so.40(opal_backtrace_buffer+0x25) [0x1461289e6b25]
[jwb0033.juwels:30662] [1] func:/p/software/juwelsbooster/stages/2024/software/OpenMPI/4.1.5-NVHPC-23.7-CUDA-12/lib/libmpi.so.40(ompi_mpi_abort+0x8e) [0x14612befe04e]
[jwb0033.juwels:30662] [2] func:/p/software/juwelsbooster/stages/2024/software/OpenMPI/4.1.5-NVHPC-23.7-CUDA-12/lib/libmpi.so.40(ompi_mpi_errors_are_fatal_comm_handler+0x36f) [0x14612beec3ef]
[jwb0033.juwels:30662] [3] func:/p/software/juwelsbooster/stages/2024/software/OpenMPI/4.1.5-NVHPC-23.7-CUDA-12/lib/libmpi.so.40(ompi_errhandler_request_invoke+0x6f0) [0x14612beebf30]
[jwb0033.juwels:30662] [4] func:/p/software/juwelsbooster/stages/2024/software/OpenMPI/4.1.5-NVHPC-23.7-CUDA-12/lib/libmpi.so.40(PMPI_Test+0xb9) [0x14612bf38879]
[jwb0033.juwels:30662] [5] func:../build-fresh-24/bin/athenaPK() [0x97c43d]
[jwb0033.juwels:30662] [6] func:../build-fresh-24/bin/athenaPK() [0x4b9bca]
[jwb0033.juwels:30662] [7] func:../build-fresh-24/bin/athenaPK() [0x4b4efa]
[jwb0033.juwels:30662] [8] func:../build-fresh-24/bin/athenaPK() [0x4b489b]
[jwb0033.juwels:30662] [9] func:/usr/lib64/libpthread.so.0(+0xfe67) [0x14612b6cbe67]
[jwb0033.juwels:30662] [10] func:../build-fresh-24/bin/athenaPK() [0x4b576e]
[jwb0033.juwels:30662] [11] func:../build-fresh-24/bin/athenaPK() [0x4b8a13]
[jwb0033.juwels:30662] [12] func:/p/software/juwelsbooster/stages/2024/software/GCCcore/12.3.0/lib64/libstdc++.so.6(+0xe0a93) [0x146129252a93]
[jwb0033.juwels:30662] [13] func:/usr/lib64/libpthread.so.0(+0x81ca) [0x14612b6c41ca]
[jwb0033.juwels:30662] [14] func:/usr/lib64/libc.so.6(clone+0x43) [0x146128a64e73]

@pgrete pgrete added the bug Something isn't working label Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant