Launching distributed training from within an actor

Sidharth_Baskaran · August 10, 2023, 3:22pm

From my understanding of the documentation on the Ray Trainer integrations (e.g. https://docs.ray.io/en/latest/train/api/doc/ray.train.huggingface.TransformersTrainer.html), it looks like there is one actor launched per process in order to conduct DDP training.

I am currently working with a setup which uses an actor as a parent to launch other tasks and maintain its own state. Would it be possible to launch distributed training from within the actor itself? Currently, distributed training does not work as it hangs during the forking stage (looks like multiprocessing is used under the hood and Ray actors operate on a single process). Since I can launch an arbitrary number of child actors from within a parent, why can’t the trainer integration operate this way?

Would be great if anyone can give me pointers on how to go about achieving this. Thanks!

matthewdeng · August 15, 2023, 5:39am

Hey @Sidharth_Baskaran, I’m not sure I fully understand your question. Could you share pseudocode for what you’re trying to achieve?

f2010126 · August 30, 2023, 3:52am

Hi @matthewdeng , I am also trying to achieve something similar. I’m working with language models and I need to use DDP to fine tune them.
As per my understanding, I was hoping each actor would be executing the training function as a separate process with access to multiple GPUs.

I’ve added a reproduction code here: [Tune] Multi GPU, Multi Node hyperparameter search not functioning · Issue #38505 · ray-project/ray · GitHub
I tried the suggestions but am still struggling. Any advice would help

Thanks!

kai · August 30, 2023, 8:41am

Hi @Sidharth_Baskaran,

yes, each worker (specified by ScalingConfig.num_workers) runs the training loop in a separate process.

You can start sub-tasks and actors from the training loop, but it requires proper resource management. You can take a look at this thread for a bit more background. Note that the way proposed in the thread can lead to resource deadlock if you’re running multiple trials (e.g. with Ray Tune).

Generally, starting Ray Train from within an actor/a function should work without problems (assuming no special scheduling strategy is used).

If you share more context on what you’re trying to do, we can see if we can help you with the correct setup.

As a side node, Ray generally does not work well with multiprocessing fork.

kai · August 30, 2023, 8:55am

@f2010126 I think your problem is unrelated. I’ve replied to the GitHub issue, let’s continue the discussion there.

Topic		Replies	Views
Distributed torch model training with Ray Core APIs Ray Core	3	593	November 3, 2023
[Ray] How to implement distributed DDP in pytorch using only pytorch And ray? Ray Tune	1	854	July 28, 2021
Ray Tune for single-node distributed training in PyTorch Ray Tune	3	1022	August 24, 2021
Ray multiprocessing together with distributed learning Ray Train	1	568	March 2, 2022
Ray actor multiple gpu available but only one used Ray Core	3	144	October 4, 2024

Launching distributed training from within an actor

Related topics