Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osu_mbw_mr for CUDA memory shows bad performance with UCX_PROTO_ENABLE=y #9690

Open
dmitrygx opened this issue Feb 15, 2024 · 0 comments
Open
Labels

Comments

@dmitrygx
Copy link
Member

Describe the bug

  1. osu_mbw_mr D D shows bad performance with UCX_PROTO_ENABLE=y in compare to UCX_PROTO_ENABLE=n.
    bad:
yt_slot_7@vla2-1174-gpu-node-hahn:~$ mpirun --prefix /opt/hpcx/ompi-ipv6 -x LD_LIBRARY_PATH --allow-run-as-root --bind-to none -mca orte_keep_fqdn_hostnames true -mca oob_tcp_if_include veth -mca plm_rsh_num_concurrent 300 -mca routed_radix 600 -mca plm_rsh_no_tree_spawn 1 -mca pmix_base_async_modex 1 --mca btl ^openib -mca pml ucx -x HPC_WORKSPACE --report-bindings -hostfile /slot/sandbox/mpi_hosts.txt -np 16 -x UCX_PROTO_ENABLE=y -x UCX_TLS=rc,cuda_copy /opt/osu-micro-benchmarks/install/mpi/pt2pt/osu_mbw_mr D D
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01694] Warning: could not find environment variable "HPC_WORKSPACE"
Warning: Permanently added '[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net]:8089,[2a02:6b8:c1d:6e8d:10d:e7ae:bcd7:0]:8089' (RSA) to the list of known hosts.
Warning: Permanently added '[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net]:8087,[2a02:6b8:c1d:6d91:10d:e7ae:5e93:0]:8087' (RSA) to the list of known hosts.
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01634] MCW rank 9 is not bound (or bound to all available processors)
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01633] MCW rank 8 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01716] MCW rank 5 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01713] MCW rank 2 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01711] MCW rank 0 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01717] MCW rank 6 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01715] MCW rank 4 is not bound (or bound to all available processors)
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01636] MCW rank 11 is not bound (or bound to all available processors)
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01637] MCW rank 12 is not bound (or bound to all available processors)
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01638] MCW rank 13 is not bound (or bound to all available processors)
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01639] MCW rank 14 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01714] MCW rank 3 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01712] MCW rank 1 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01718] MCW rank 7 is not bound (or bound to all available processors)
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01635] MCW rank 10 is not bound (or bound to all available processors)
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01640] MCW rank 15 is not bound (or bound to all available processors)
# OSU MPI Multiple Bandwidth / Message Rate Test v7.2
# OSU MPI Multiple Bandwidth / Message Rate Test v7.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)

# [ pairs: 8 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
# Datatype: MPI_CHAR.
1                       1.30        1301272.82
2                       2.59        1292791.72
4                       5.23        1307511.58
8                      10.37        1295688.95
16                     20.81        1300802.14
32                     41.28        1289854.91
64                     83.16        1299317.21
128                   165.58        1293585.26
256                   718.29        2805839.37
512                  1406.10        2746282.98
1024                 2843.87        2777215.61
2048                 5729.96        2797829.95
4096                10108.80        2467968.94
8192                16898.19        2062767.02
16384               18847.49        1150359.41
32768               19343.57         590319.06
65536               19699.38         300588.63
131072              18589.76         141828.64
262144              18676.20          71244.04
524288              19973.50          38096.43
1048576             20399.55          19454.53
2097152             19899.64           9488.89
4194304             19865.26           4736.25

good (see results for 128 KB):

yt_slot_7@vla2-1174-gpu-node-hahn:~$ mpirun --prefix /opt/hpcx/ompi-ipv6 -x LD_LIBRARY_PATH --allow-run-as-root --bind-to none -mca orte_keep_fqdn_hostnames true -mca oob_tcp_if_include veth -mca plm_rsh_num_concurrent 300 -mca routed_radix 600 -mca plm_rsh_no_tree_spawn 1 -mca pmix_base_async_modex 1 --mca btl ^openib -mca pml ucx -x HPC_WORKSPACE --report-bindings -hostfile /slot/sandbox/mpi_hosts.txt -np 16 -x UCX_PROTO_ENABLE=n -x UCX_TLS=rc,cuda_copy  /opt/osu-micro-benchmarks/install/mpi/pt2pt/osu_mbw_mr D D
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01767] Warning: could not find environment variable "HPC_WORKSPACE"
Warning: Permanently added '[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net]:8087,[2a02:6b8:c1d:6d91:10d:e7ae:5e93:0]:8087' (RSA) to the list of known hosts.
Warning: Permanently added '[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net]:8089,[2a02:6b8:c1d:6e8d:10d:e7ae:bcd7:0]:8089' (RSA) to the list of known hosts.
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01700] MCW rank 8 is not bound (or bound to all available processors)
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01702] MCW rank 10 is not bound (or bound to all available processors)
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01704] MCW rank 12 is not bound (or bound to all available processors)
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01703] MCW rank 11 is not bound (or bound to all available processors)
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01706] MCW rank 14 is not bound (or bound to all available processors)
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01701] MCW rank 9 is not bound (or bound to all available processors)
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01705] MCW rank 13 is not bound (or bound to all available processors)
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01707] MCW rank 15 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01785] MCW rank 1 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01790] MCW rank 6 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01789] MCW rank 5 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01786] MCW rank 2 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01787] MCW rank 3 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01784] MCW rank 0 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01788] MCW rank 4 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01791] MCW rank 7 is not bound (or bound to all available processors)
# OSU MPI Multiple Bandwidth / Message Rate Test v7.2
# OSU MPI Multiple Bandwidth / Message Rate Test v7.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)

# [ pairs: 8 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
# Datatype: MPI_CHAR.
1                       0.65         654098.23
2                       1.47         734891.91
4                       2.91         726428.07
8                       5.87         734064.96
16                     11.67         729143.99
32                     23.45         732678.31
64                     45.97         718281.94
128                    91.89         717865.58
256                   183.69         717543.84
512                   369.53         721732.15
1024                  726.79         709751.45
2048                 1422.15         694410.14
4096                 2778.02         678226.51
8192                 5499.50         671325.74
16384                5514.58         336583.34
32768                5610.06         171205.37
65536                5646.15          86153.46
131072              95134.90         725821.70
262144              96816.85         369326.97
524288              97889.50         186709.40
1048576             98319.89          93765.15
2097152             98569.71          47001.70
4194304             98699.50          23531.79
  1. suboptimal selection of thresholds with UCX_PROTO_ENABLE=n, setting UCX_RNDV_THRESHOLD=0 improves it

UCX_RNDV_THRESH default:

yt_slot_7@vla2-1174-gpu-node-hahn:~$ mpirun --prefix /opt/hpcx/ompi-ipv6 -x LD_LIBRARY_PATH --allow-run-as-root --bind-to none -mca orte_keep_fqdn_hostnames true -mca oob_tcp_if_include veth -mca plm_rsh_num_concurrent 300 -mca routed_radix 600 -mca plm_rsh_no_tree_spawn 1 -mca pmix_base_async_modex 1 --mca btl ^openib -mca pml ucx -x HPC_WORKSPACE --report-bindings -hostfile /slot/sandbox/mpi_hosts.txt -np 16 -x UCX_PROTO_ENABLE=n -x UCX_TLS=rc,cuda_copy /opt/osu-micro-benchmarks/install/mpi/pt2pt/osu_mbw_mr D D
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01767] Warning: could not find environment variable "HPC_WORKSPACE"
Warning: Permanently added '[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net]:8087,[2a02:6b8:c1d:6d91:10d:e7ae:5e93:0]:8087' (RSA) to the list of known hosts.
Warning: Permanently added '[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net]:8089,[2a02:6b8:c1d:6e8d:10d:e7ae:bcd7:0]:8089' (RSA) to the list of known hosts.
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01700] MCW rank 8 is not bound (or bound to all available processors)
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01702] MCW rank 10 is not bound (or bound to all available processors)
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01704] MCW rank 12 is not bound (or bound to all available processors)
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01703] MCW rank 11 is not bound (or bound to all available processors)
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01706] MCW rank 14 is not bound (or bound to all available processors)
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01701] MCW rank 9 is not bound (or bound to all available processors)
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01705] MCW rank 13 is not bound (or bound to all available processors)
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01707] MCW rank 15 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01785] MCW rank 1 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01790] MCW rank 6 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01789] MCW rank 5 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01786] MCW rank 2 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01787] MCW rank 3 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01784] MCW rank 0 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01788] MCW rank 4 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01791] MCW rank 7 is not bound (or bound to all available processors)
# OSU MPI Multiple Bandwidth / Message Rate Test v7.2
# OSU MPI Multiple Bandwidth / Message Rate Test v7.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)

# [ pairs: 8 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
# Datatype: MPI_CHAR.
1                       0.65         654098.23
2                       1.47         734891.91
4                       2.91         726428.07
8                       5.87         734064.96
16                     11.67         729143.99
32                     23.45         732678.31
64                     45.97         718281.94
128                    91.89         717865.58
256                   183.69         717543.84
512                   369.53         721732.15
1024                  726.79         709751.45
2048                 1422.15         694410.14
4096                 2778.02         678226.51
8192                 5499.50         671325.74
16384                5514.58         336583.34
32768                5610.06         171205.37
65536                5646.15          86153.46
131072              95134.90         725821.70
262144              96816.85         369326.97
524288              97889.50         186709.40
1048576             98319.89          93765.15
2097152             98569.71          47001.70
4194304             98699.50          23531.79

UCX_RNDV_THRESH=0:

yt_slot_7@vla2-1174-gpu-node-hahn:~$ mpirun --prefix /opt/hpcx/ompi-ipv6 -x LD_LIBRARY_PATH --allow-run-as-root --bind-to none -mca orte_keep_fqdn_hostnames true -mca oob_tcp_if_include veth -mca plm_rsh_num_concurrent 300 -mca routed_radix 600 -mca plm_rsh_no_tree_spawn 1 -mca pmix_base_async_modex 1 --mca btl ^openib -mca pml ucx -x HPC_WORKSPACE --report-bindings -hostfile /slot/sandbox/mpi_hosts.txt -np 16 -x UCX_PROTO_ENABLE=n -x UCX_TLS=rc,cuda_copy -x UCX_RNDV_THRESH=0 /opt/osu-micro-benchmarks/install/mpi/pt2pt/osu_mbw_mr D D
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01840] Warning: could not find environment variable "HPC_WORKSPACE"
Warning: Permanently added '[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net]:8089,[2a02:6b8:c1d:6e8d:10d:e7ae:bcd7:0]:8089' (RSA) to the list of known hosts.
Warning: Permanently added '[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net]:8087,[2a02:6b8:c1d:6d91:10d:e7ae:5e93:0]:8087' (RSA) to the list of known hosts.
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01768] MCW rank 9 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01863] MCW rank 6 is not bound (or bound to all available processors)
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01769] MCW rank 10 is not bound (or bound to all available processors)
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01772] MCW rank 13 is not bound (or bound to all available processors)
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01771] MCW rank 12 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01859] MCW rank 2 is not bound (or bound to all available processors)
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01773] MCW rank 14 is not bound (or bound to all available processors)
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01770] MCW rank 11 is not bound (or bound to all available processors)
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01774] MCW rank 15 is not bound (or bound to all available processors)
[vla2-0693-gpu-node-hahn.vla.yp-c.yandex.net:01767] MCW rank 8 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01860] MCW rank 3 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01864] MCW rank 7 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01858] MCW rank 1 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01862] MCW rank 5 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01861] MCW rank 4 is not bound (or bound to all available processors)
[vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net:01857] MCW rank 0 is not bound (or bound to all available processors)
# OSU MPI Multiple Bandwidth / Message Rate Test v7.2
# OSU MPI Multiple Bandwidth / Message Rate Test v7.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)

# [ pairs: 8 ] [ window size: 64 ]
# Size                  MB/s        Messages/s
# Datatype: MPI_CHAR.
1                       1.09        1087855.69
2                       2.18        1090733.52
4                       4.34        1083929.55
8                       8.71        1088492.09
16                     17.37        1085410.99
32                     34.89        1090425.40
64                     69.67        1088545.76
128                   715.01        5585980.85
256                  1420.29        5548017.93
512                  2851.20        5568754.76
1024                 5727.37        5593134.43
2048                11417.72        5575059.19
4096                21845.33        5333331.67
8192                45847.66        5596638.26
16384               70479.59        4301732.96
32768               82019.35        2503031.99
65536               89499.37        1365652.02
131072              94064.08         717651.99
262144              96358.34         367577.91
524288              97573.52         186106.72
1048576             98179.40          93631.18
2097152             98501.57          46969.21
4194304             98661.19          23522.66

Steps to Reproduce

  • Command line
mpirun --prefix /opt/hpcx/ompi-ipv6 -x LD_LIBRARY_PATH --allow-run-as-root --bind-to none -mca orte_keep_fqdn_hostnames true -mca oob_tcp_if_include veth -mca plm_rsh_num_concurrent 300 -mca routed_radix 600 -mca plm_rsh_no_tree_spawn 1 -mca pmix_base_async_modex 1 --mca btl ^openib -mca pml ucx -x HPC_WORKSPACE --report-bindings -hostfile /slot/sandbox/mpi_hosts.txt -np 16 -x UCX_RNDV_THRESH=0 -x UCX_TLS=rc,cuda_copy /opt/osu-micro-benchmarks/install/mpi/pt2pt/osu_mbw_mr D D

The host file contains two hosts with slot=8 per each.

  • UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by ucx_info -v)
yt_slot_3@sas8-6443-gpu-node-hahn:~$ ucx_info -v
# Library version: 1.16.0
# Library path: /opt/hpcx/ucx/lib/libucs.so.0
# API headers version: 1.16.0
# Git branch '', revision 687e41b
# Configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --without-knem --with-xpmem=/hpc/local/oss/xpmem/v2.7.1 --without-java --enable-devel-headers --with-fuse3-static --with-cuda=/hpc/local/oss/cuda12.2.2 --with-gdrcopy --prefix=/build-result/hpcx-v2.17-gcc-mlnx_ofed-ubuntu22.04-cuda12-x86_64/ucx --with-bfd=/hpc/local/oss/binutils/2.37
  • Any UCX environment variables used
    Nope

Setup and versions

  • OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
    • cat /etc/issue or cat /etc/redhat-release + uname -a
yt_slot_7@vla2-1174-gpu-node-hahn:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.4 LTS
Release:        20.04
Codename:       focal
yt_slot_7@vla2-1174-gpu-node-hahn:~$ uname -a
Linux vla2-1174-gpu-node-hahn.vla.yp-c.yandex.net 5.4.210-39.1.bert #1 SMP Fri Apr 7 11:03:23 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • For Nvidia Bluefield SmartNIC include cat /etc/mlnx-release (the string identifies software and firmware setup)
  • For RDMA/IB/RoCE related issues:
    • Driver version:
      • rpm -q rdma-core or rpm -q libibverbs
      • or: MLNX_OFED version ofed_info -s
    • HW information from ibstat or ibv_devinfo -vv command
  • For GPU related issues:
    • GPU type
    • Cuda:
      • Drivers version
yt_slot_7@vla2-1174-gpu-node-hahn:~$ nvidia-smi | grep -i driver
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
  - Check if peer-direct is loaded: `lsmod|grep nv_peer_mem` and/or gdrcopy: `lsmod|grep gdrdrv`
yt_slot_7@vla2-1174-gpu-node-hahn:~$ lsmod|grep nv_peer_mem
nv_peer_mem            16384  0
nvidia              56152064  481 nvidia_uvm,nv_peer_mem,nvidia_modeset
ib_core               389120  8 rdma_cm,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
yt_slot_7@vla2-1174-gpu-node-hahn:~$ lsmod|grep gdrdrv
<empty>

Additional information (depending on the issue)

  • OpenMPI version
yt_slot_7@vla2-1174-gpu-node-hahn:~$ ompi_info --version
Open MPI v4.1.5rc2

http://www.open-mpi.org/community/help/
  • Output of ucx_info -d to show transports and devices recognized by UCX
yt_slot_7@vla2-1174-gpu-node-hahn:~$ ibv_devinfo
hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         20.31.1014
        node_guid:                      043f:7203:00ef:6394
        sys_image_guid:                 043f:7203:00ef:6394
        vendor_id:                      0x02c9
        vendor_part_id:                 4123
        hw_ver:                         0x0
        board_id:                       MT_0000000223
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 7
                        port_lid:               527
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_1
        transport:                      InfiniBand (0)
        fw_ver:                         20.31.1014
        node_guid:                      043f:7203:00ef:60c8
        sys_image_guid:                 043f:7203:00ef:60c8
        vendor_id:                      0x02c9
        vendor_part_id:                 4123
        hw_ver:                         0x0
        board_id:                       MT_0000000223
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 7
                        port_lid:               514
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_2
        transport:                      InfiniBand (0)
        fw_ver:                         20.31.1014
        node_guid:                      043f:7203:00ef:639c
        sys_image_guid:                 043f:7203:00ef:639c
        vendor_id:                      0x02c9
        vendor_part_id:                 4123
        hw_ver:                         0x0
        board_id:                       MT_0000000223
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 7
                        port_lid:               520
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_4
        transport:                      InfiniBand (0)
        fw_ver:                         20.31.1014
        node_guid:                      043f:7203:00ef:6108
        sys_image_guid:                 043f:7203:00ef:6108
        vendor_id:                      0x02c9
        vendor_part_id:                 4123
        hw_ver:                         0x0
        board_id:                       MT_0000000223
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 7
                        port_lid:               511
                        port_lmc:               0x00
                        link_layer:             InfiniBand
  • Configure result - config.log
    See ucx_info -v above
@dmitrygx dmitrygx added the Bug label Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant