Is Ray Air's performance worse than Horovod?

shaojun_li · July 13, 2023, 6:59am

We observed a performance degradation of approximately 5% to 10% when using Ray Air compared to Horovod. Is this performance decline reasonable? How can we improve the performance of Ray Air?
PS: We have integrated the DeepSpeed framework, utilized stage 3, and executed on 64 A100 GPUs.

kai · July 16, 2023, 3:24pm

Just to clarify, do you mean degradation in speed or model performance?

In both cases, theoretically there shouldn’t be any degradation as Ray AIR only orchestrates the distributed runners, the inter-process communication still happens via Horovod. However, if you report a lot of results (e.g. multiple times a second) or generally call Ray APIs a lot during training it may impact speed.

Also, if we’re talking about speed, what time frame are we talking about? Is this an hour long training or minutes? There may be some constant setup overhead that affects short runs more than long runs.

If you can provide your training script, we can try to take a look.

Jules_Damji · July 17, 2023, 4:24pm

Thanks @kai for elaboration. @shaojun_li as @kai would be helpful if you share more context, especially the script.

shaojun_li · July 20, 2023, 9:55am

Thank you very much for @kai 's response.
Just to clarify, I meant that the training speed has decreased. When running the LLAMA 33B model on 32 A100 GPUs, there is approximately a 5% difference in speed between starting with Horovod and starting with Ray Air.

After about ten minutes of training, both methods eventually stabilize. Once the training tasks are running stably, both methods maintain a performance difference of around 5%.

shaojun_li · July 20, 2023, 11:45am

We made modifications based on microsoft/DeepSpeedExamples, enabling DeepSpeed zero3

github.com

microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py

#!/usr/bin/env python
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0

# DeepSpeed Team
import argparse
import os
import math
import sys

import torch
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from torch.utils.data.distributed import DistributedSampler

from transformers import (
    AutoModelForCausalLM,
    SchedulerType,
    default_data_collator,
    get_scheduler,
)

This file has been truncated. show original

We used both Horovod and Ray Air to start the process
1、horovod:

mpirun -bind-to none -map-by slot -x  NCCL_SOCKET_IFNAME=eth0 -x OMPI_MCA_btl_tcp_if_include=eth0 -x LD_LIBRARY_PATH -x PATH -x HOROVOD_STALL_CHECK_TIME_SECONDS=600 -mca routed direct -mca pml ob1 -mca btl ^openib -np 32 bash run_33b.sh

2、ray air
2.1 For Ray air, we made changes to the main method

if __name__ == "__main__":
    args = parse_args()
	runtime_env = RuntimeEnv(
    env_vars={'CUDA_LAUNCH_BLOCKING':'1','NCCL_IB_GID_INDEX':'3','NCCL_NET_GDR_LEVEL':'2','NCCL_IB_QPS_PER_CONNECTION':'4','NCCL_IB_TC':'160','NCCL_IB_TIMEOUT':'22','NCCL_IBEXT_DISABLE':'0','NCCL_IB_HCA': 'mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5','NCCL_IB_DISABLE':'0','NCCL_DEBUG':'INFO','OMPI_MCA_btl_tcp_if_include':'eth0','NCCL_SOCKET_IFNAME':'eth0'
     }
    )
    torchConfig = TorchConfig(backend="nccl")
    ray.init(address="auto", runtime_env=runtime_env)
    trainer = TorchTrainer(
        train_loop_per_worker=main,
        train_loop_config={"args":args,"path":sys.path},
		torch_config=torchConfig,
        scaling_config=ScalingConfig(num_workers=32, use_gpu=True,resources_per_worker={"CPU":7}),
    )
    result = trainer.fit()

2.2 The modification of the main method：

def main(config: dict):
    args = config.get("args")
    sys.argv = args
    sys.path=config.get("path")
	...

kai · July 24, 2023, 2:37pm

Can you show more how you run the code with horovod? How does run_33b.sh look like?

On a high-level, deepspeed/torch’s and horovod’s data communication implementations are different so can differ in performance. It would be interesting to compare raw pytorch/deepspeed vs. Ray AIR.

One thing that comes to mind is that some of the CPU processing may be slowed down by the way Ray sets up threading. You could try exporting OMP_NUM_THREADS=512 or similar on your machines or your runtime env (ray.init(runtime_env={"env_vars": {"OMP_NUM_THREADS": 512}})) to see if this speeds up things.

shaojun_li · July 25, 2023, 2:25am

“Thank you very much for your answer.
Yesterday, I found that the performance degraded because I added the environment variable {“CUDA_LAUNCH_BLOCKING”:1} when starting Ray. When I removed this configuration, there was no difference in performance between Ray Air and Horovod.”

kai · July 25, 2023, 3:58am

Thanks for the update!

Topic		Replies	Views
Ray Tune is slowing down lightning model performance by 3x Ray Train	5	568	October 22, 2022
Performance issue of back-propagation in using RaySGD Ray Tune	3	361	July 30, 2021
Ray tune + deepspeed integration Ray Tune	1	40	February 21, 2025
Large (5x) difference in Ray AIR memory usage on different machines	4	449	January 12, 2023
Ray Tune event loop backlogged, slow with checkpointing Ray Tune	7	1616	September 28, 2021

Is Ray Air's performance worse than Horovod?

Related topics