-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU Aware openMPI 5.0.1 + ROCM gives UCX ERROR : failed to register address #9589
Comments
@denisbertini thank you for the report. What GPU is this if I may ask? Also, how difficult is it set up the application/testcase for reproducing? |
The GPU used is AMD GPU MI 100 and the OS is Rocky Linux 8.8 |
@denisbertini is there an easy way to run a simple test that reproduces the issue? |
I tried with GPU aware osu benchmarks, but i all works fine, so it seems to be related to AMReX code itself. |
What is your iommu setting if I may ask? And one other thing that comes to my mind is whether acs is disabled? |
which command(s) should i use to get this info from UCX? |
for the first one, try cat /proc/cmdline | grep iommu |
IOMMU is now enabled in both BIOS, Grub boot manager on 2 of our GPUs nodes.
|
with debug output:
|
@denisbertini what about the PCI ACS. is that also disabled on the nodes? I am 99% confident that this is a system setup issue, not a UCX issue since we are running the same software configuration daily in our internal setup, we'll just have to identify what is triggering the problem. |
@edgargabriel Answer from our sys. admin colleague:
|
Do you have a script/recipe that I could use to reproduce the run on one of our internal systems? e.g. how to compile and run the code, what input files are required etc. |
what mofed version are you running btw. on that system? |
we do not use mellanox official MOFED but the linux rdma-core library |
could you in that case check whether the ib_peer_mem kernel module is running/used? |
Not simple to repoduce but you can give a try:
After these step you should then have all 3 executable after that you can run a simulation :
The inputs_3d.txt in in attachment. |
@denisbertini thank you! I will give it a try, but it might take me a few days until I get to it. |
I can understand that ! |
@denisbertini I have successfully compiled the application, but I have trouble running it. The system here does not have singularity installed, but independent of that I get the following error when starting the application:
Do I need to compile/provide HDF5 as well for the app? |
please try again with this new input file: |
ok, this worked now, thanks! I ran it on 2 nodes, each node with 4 MI100 GPUs + InfiniBand, and it seemed to work without issues (UCX 1.15.0, Open MPI 5.0.1, but ROCm 5.7.1 and mofed installed on these nodes). I will try to run tomorrow on 2 nodes with 8 GPUs each (since that changes the underlying topology)
|
could you try with ROCm 6.0 ? |
unfortunately not, I cannot update the rocm version on that cluster, that is not under my control. However, the ROCm version plays a miniscule role in this scenario in my opinion. If the 8 GPUs per node scenario with 2 nodes ( 16 GPUs) also works, my main suspicion is actually on the mofed vs. rdma-core library. The managers of this cluster also tried originally the rdma-core library, but had to switch to mofed. |
I can confirm that the 16 process run (2 nodes with 8 processes/GPUs each of MI100) also finished correctly (though the job complained about too many resources and too little work :-) ) I am at this point 99% sure that this is not a UCX/Open MPI issue, but we are dealing with some configuration aspect on your cluster. |
Could you ask the sys. admin of your cluster, the reason(s) why they had to move from rdma-core to official MOFED library? |
it was because of issues making the GPUDirectRDMA work. We have GPUDIrectRDMA using rdma-core working meanwhile on a cluster that uses Broadcomm NICs, but for InfiniBand/Mellanox RoCE HCAs we always use MOFED. |
Could you please try again the test with the modified submit script:
|
@edgargabriel
This issue seems also to be linked to upstreamed ofed drivers in RHEL. See for example this related issue: |
@denisbertini I submitted a job with the new syntax, will update the ticket once the job finishes. The issue that you are pointing to was a problem that we had in ucx 1.13.x, but was resolved starting ucx 1.14.x , so it should hopefully not apply here (unless your job pulls in accidentally a wrong ucx library, e.g. because LD_LIBRARY_PATH is not set to point to ucx 1.15/lib or similar) |
I reran the code with the changed settings/arguments, the code still finished correctly on our system
|
OK then it is now clear ... thanks ! |
Standard OSU benchmarks test
|
ok, so very clearly memory registration of GPU memory doesn't work on your system, independent of the application. This leads to things that I suggested and also mentioned in the email communication:
I think these are your two options. |
When using (AMReX code](https://amrex-codes.github.io/amrex) with
program crashes immediately with invalid buffer size :
Intra-node communication works though.
Any ideas what can be wrong ?
The text was updated successfully, but these errors were encountered: