Is Ray Air's performance worse than Horovod?

We observed a performance degradation of approximately 5% to 10% when using Ray Air compared to Horovod. Is this performance decline reasonable? How can we improve the performance of Ray Air?
PS: We have integrated the DeepSpeed framework, utilized stage 3, and executed on 64 A100 GPUs.

1 Like

Just to clarify, do you mean degradation in speed or model performance?

In both cases, theoretically there shouldn’t be any degradation as Ray AIR only orchestrates the distributed runners, the inter-process communication still happens via Horovod. However, if you report a lot of results (e.g. multiple times a second) or generally call Ray APIs a lot during training it may impact speed.

Also, if we’re talking about speed, what time frame are we talking about? Is this an hour long training or minutes? There may be some constant setup overhead that affects short runs more than long runs.

If you can provide your training script, we can try to take a look.

Thanks @kai for elaboration. @shaojun_li as @kai would be helpful if you share more context, especially the script.

1 Like

Thank you very much for @kai 's response.
Just to clarify, I meant that the training speed has decreased. When running the LLAMA 33B model on 32 A100 GPUs, there is approximately a 5% difference in speed between starting with Horovod and starting with Ray Air.

After about ten minutes of training, both methods eventually stabilize. Once the training tasks are running stably, both methods maintain a performance difference of around 5%.

We made modifications based on microsoft/DeepSpeedExamples, enabling DeepSpeed zero3

We used both Horovod and Ray Air to start the process
1、horovod:

mpirun -bind-to none -map-by slot -x  NCCL_SOCKET_IFNAME=eth0 -x OMPI_MCA_btl_tcp_if_include=eth0 -x LD_LIBRARY_PATH -x PATH -x HOROVOD_STALL_CHECK_TIME_SECONDS=600 -mca routed direct -mca pml ob1 -mca btl ^openib -np 32 bash run_33b.sh

2、ray air
2.1 For Ray air, we made changes to the main method

if __name__ == "__main__":
    args = parse_args()
	runtime_env = RuntimeEnv(
    env_vars={'CUDA_LAUNCH_BLOCKING':'1','NCCL_IB_GID_INDEX':'3','NCCL_NET_GDR_LEVEL':'2','NCCL_IB_QPS_PER_CONNECTION':'4','NCCL_IB_TC':'160','NCCL_IB_TIMEOUT':'22','NCCL_IBEXT_DISABLE':'0','NCCL_IB_HCA': 'mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5','NCCL_IB_DISABLE':'0','NCCL_DEBUG':'INFO','OMPI_MCA_btl_tcp_if_include':'eth0','NCCL_SOCKET_IFNAME':'eth0'
     }
    )
    torchConfig = TorchConfig(backend="nccl")
    ray.init(address="auto", runtime_env=runtime_env)
    trainer = TorchTrainer(
        train_loop_per_worker=main,
        train_loop_config={"args":args,"path":sys.path},
		torch_config=torchConfig,
        scaling_config=ScalingConfig(num_workers=32, use_gpu=True,resources_per_worker={"CPU":7}),
    )
    result = trainer.fit()

2.2 The modification of the main method:

def main(config: dict):
    args = config.get("args")
    sys.argv = args
    sys.path=config.get("path")
	...

Can you show more how you run the code with horovod? How does run_33b.sh look like?

On a high-level, deepspeed/torch’s and horovod’s data communication implementations are different so can differ in performance. It would be interesting to compare raw pytorch/deepspeed vs. Ray AIR.

One thing that comes to mind is that some of the CPU processing may be slowed down by the way Ray sets up threading. You could try exporting OMP_NUM_THREADS=512 or similar on your machines or your runtime env (ray.init(runtime_env={"env_vars": {"OMP_NUM_THREADS": 512}})) to see if this speeds up things.

“Thank you very much for your answer.
Yesterday, I found that the performance degraded because I added the environment variable {“CUDA_LAUNCH_BLOCKING”:1} when starting Ray. When I removed this configuration, there was no difference in performance between Ray Air and Horovod.”

1 Like

Thanks for the update!