You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The container file https://github.com/containers/ai-lab-recipes/blob/main/training/intel-bootc/Containerfile does not contain necessary bits and pieces to setup InfiniBand Intel Gaudi devices. Without IB, it is not possible to use Intel Gaudi 2 cards for training. I'm not entirely sure if this only affects servers with multiple Intel Gaudi 2 cards or also servers with a single card. My test systems have 8 Intel Gaudi 2 cards each.
The server with bootc image did not have the habanalabs_ib module loaded. Manually loading the module doesn't make a difference. rdma dev does not show any hlib (Babana Labs InfiniBand) devices. Without the devices, PyTorch and Habana's PyTorch plugin habana_frameworks fail to initialize the devices: hcl_ibverbs_t::init failed to find matching Habana IB device (hlib_6)
On the other bare metal server with regular RHEL 9 and Habana's packages, rdma dev shows a node for each card and rdma link shows over 160 active connections with LINK_UP.
Could not test upstream as build checks failed because of kernel version in runners.
Need to find a solution for it but apart from that, the fix adds packages that made the bootc image work with Gaudi2 hardware, so I guess this issue can be closed by now.
The container file https://github.com/containers/ai-lab-recipes/blob/main/training/intel-bootc/Containerfile does not contain necessary bits and pieces to setup InfiniBand Intel Gaudi devices. Without IB, it is not possible to use Intel Gaudi 2 cards for training. I'm not entirely sure if this only affects servers with multiple Intel Gaudi 2 cards or also servers with a single card. My test systems have 8 Intel Gaudi 2 cards each.
The server with bootc image did not have the
habanalabs_ib
module loaded. Manually loading the module doesn't make a difference.rdma dev
does not show anyhlib
(Babana Labs InfiniBand) devices. Without the devices, PyTorch and Habana's PyTorch pluginhabana_frameworks
fail to initialize the devices:hcl_ibverbs_t::init failed to find matching Habana IB device (hlib_6)
On the other bare metal server with regular RHEL 9 and Habana's packages,
rdma dev
shows a node for each card andrdma link
shows over 160 active connections with LINK_UP.# rdma dev 0: mlx5_0: node_type ca fw 20.32.2004 node_guid b83f:d203:004a:5242 sys_image_guid b83f:d203:004a:5242 1: mlx5_1: node_type ca fw 20.32.2004 node_guid b83f:d203:004a:522e sys_image_guid b83f:d203:004a:522e 18: hlib_4: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000 19: hlib_6: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000 20: hlib_2: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000 21: hlib_7: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000 22: hlib_1: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000 23: hlib_3: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000 24: hlib_5: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000 25: hlib_0: node_type unspecified fw 49.0.0 node_guid 0000:0000:0000:0000 sys_image_guid 0000:0000:0000:0000
journal shows that
habanalabs
Kernel module registers a habana labs infiniband device for each card:# journalctl -o cat | grep hlib habanalabs 0000:db:00.0 hlib_4: IB device registered habanalabs 0000:bb:00.0 hlib_2: IB device registered habanalabs 0000:19:00.0 hlib_6: IB device registered habanalabs 0000:5d:00.0 hlib_7: IB device registered habanalabs 0000:9b:00.0 hlib_1: IB device registered habanalabs 0000:cb:00.0 hlib_3: IB device registered habanalabs 0000:4c:00.0 hlib_5: IB device registered habanalabs 0000:3b:00.0 hlib_0: IB device registered
According to https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html we also need the
habanalabs-thunk
andhabanalabs-rdma-core
packages in the bootc image.reproducer
The text was updated successfully, but these errors were encountered: