Does Ray Air support InfiniBand network

Originally, my training code was running on the InfiniBand network using mpi+deepspeed, and the acceleration was very detailed. I tried to use ray air, and valued its fault-tolerant ability. After the transformation, I found that ray air cannot use InfiniBand, and it still uses the Ethernet network. When running a large model of 33b in this way, compared with my original code, the training speed will be linear Decline, probably only about 10% of the original.
Here is the code I modified:

torchConfig = TorchConfig(backend="nccl")
ray.init(address="auto")
trainer = TorchTrainer(
    train_loop_per_worker=main,
    train_loop_config={"hparam":hparam},
    torch_config=torchConfig,
    scaling_config=ScalingConfig(num_workers=32, use_gpu=True,resources_per_worker={"CPU":8}),
)
result = trainer.fit()

Is it true that ray air does not support InfiniBand network, or is my configuration wrong?

Hey @shaojun_li , AIR has no restrictions on the network types for NCCL. Could you check your NCCL configurations? For example, have you set the NCCL_SOCKET_IFNAME env var on your head node?

Some references:
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-socket-ifname

https://docs.ray.io/en/latest/train/faq.html#my-multi-node-pytorch-gpu-training-is-hanging-or-giving-me-obscure-nccl-errors-what-do-i-do

Also, could you elaborate more about how you set up the infiniband environment without Ray? I am willing to help replicating the issue on our end : )

1 Like

thank you for your reply!

This is the environment variable configured.

hparam = get_args()
runtime_env = RuntimeEnv(
env_vars={'NCCL_IB_GID_INDEX':'3','NCCL_IB_HCA': 'mlx5_14,'NCCL_IB_DISABLE':'0','NCCL_DEBUG':'INFO','OMPI_MCA_btl_tcp_if_include':'eth0','NCCL_SOCKET_IFNAME':'eth0'
}
)
torchConfig = TorchConfig(backend="nccl")
ray.init(address="auto", runtime_env=runtime_env)
trainer = TorchTrainer(
    train_loop_per_worker=main,
    train_loop_config={"hparam":hparam},
    torch_config=torchConfig,
    scaling_config=ScalingConfig(num_workers=16, use_gpu=True,resources_per_worker={"CPU":8}),
)
result = trainer.fit()

This is the log information of NCCL。

NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v5 symbol.
NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v5 symbol.
NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
NCCL INFO P2P plugin IBext
NCCL INFO NET/IB : No device found.
NCCL INFO NCCL_IB_DISABLE set by environment to 0.
NCCL INFO NET/IB : No device found.
NCCL INFO NET/Socket : Using [0]eth0:10.237.59.102<0>
NCCL INFO Using network Socket
NCCL version 2.12.12+cuda11.7
NCCL INFO Setting affinity for GPU 7 to ffffff00,0000ffff,ff000000
NCCL INFO Trees [0] -1/-1/-1->15->14 [1] -1/-1/-1->15->14
NCCL INFO Channel 00/0 : 1[27000] → 8[21000] [send] via NET/Socket/0
NCCL INFO Channel 01/0 : 1[27000] → 8[21000] [send] via NET/Socket/0
NCCL INFO Channel 00 : 6[c9000] → 5[92000] via P2P/IPC/read
NCCL INFO Channel 01 : 6[c9000] → 5[92000] via P2P/IPC/read
NCCL INFO Connected all rings
NCCL INFO Channel 00/02 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
NCCL INFO Channel 01/02 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9
NCCL INFO Channel 00/0 : 9[27000] → 0[21000] [receive] via NET/Socket/0
NCCL INFO Channel 01/0 : 9[27000] → 0[21000] [receive] via NET/Socket/0
NCCL INFO Connected all trees
NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512
NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
NCCL INFO comm 0x7f9578009010 rank 15 nranks 16 cudaDev 7 busId cf000 - Init COMPLETE
NCCL INFO Launch mode Parallel
NCCL INFO Bootstrap : Using eth0:10.237.59.102<0>e[32m [repeated 15x across cluster]e[0m
NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v5 symbol.e[32m [repeated 30x across cluster]e[0m
NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.soe[32m [repeated 15x across cluster]e[0m
NCCL INFO P2P plugin IBexte[32m [repeated 15x across cluster]e[0m
NCCL INFO NET/IB : No device found.e[32m [repeated 30x across cluster]e[0m
NCCL INFO NCCL_IB_DISABLE set by environment to 0.e[32m [repeated 15x across cluster]e[0m
NCCL INFO NET/Socket : Using [0]eth0:10.237.59.101<0>e[32m [repeated 15x across cluster]e[0m
NCCL INFO Using network Sockete[32m [repeated 15x across cluster]e[0m
NCCL INFO Setting affinity for GPU 5 to ffffff00,0000ffff,ff000000e[32m [repeated 15x across cluster]e[0m
NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2e[32m [repeated 15x across cluster]e[0m
NCCL INFO Channel 01/0 : 8[21000] → 0[21000] [send] via NET/Socket/0e[32m [repeated 6x across cluster]e[0m
NCCL INFO Channel 01 : 5[92000] → 6[c9000] via P2P/IPC/reade[32m [repeated 58x across cluster]e[0m
NCCL INFO Connected all ringse[32m [repeated 15x across cluster]e[0m
NCCL INFO Channel 01/0 : 0[21000] → 8[21000] [receive] via NET/Socket/0e[32m [repeated 6x across cluster]e[0m
NCCL INFO Connected all treese[32m [repeated 15x across cluster]e[0m
NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/512e[32m [repeated 15x across cluster]e[0m
NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peere[32m [repeated 15x across cluster]e[0m
NCCL INFO comm 0x7f7968009010 rank 3 nranks 16 cudaDev 3 busId 56000 - Init COMPLETEe[32m [repeated 15x across cluster]e[0m