Launching distributed training from within an actor

From my understanding of the documentation on the Ray Trainer integrations (e.g. ray.train.huggingface.TransformersTrainer — Ray 2.6.1), it looks like there is one actor launched per process in order to conduct DDP training.

I am currently working with a setup which uses an actor as a parent to launch other tasks and maintain its own state. Would it be possible to launch distributed training from within the actor itself? Currently, distributed training does not work as it hangs during the forking stage (looks like multiprocessing is used under the hood and Ray actors operate on a single process). Since I can launch an arbitrary number of child actors from within a parent, why can’t the trainer integration operate this way?

Would be great if anyone can give me pointers on how to go about achieving this. Thanks!

Hey @Sidharth_Baskaran, I’m not sure I fully understand your question. Could you share pseudocode for what you’re trying to achieve?

Hi @matthewdeng , I am also trying to achieve something similar. I’m working with language models and I need to use DDP to fine tune them.
As per my understanding, I was hoping each actor would be executing the training function as a separate process with access to multiple GPUs.

I’ve added a reproduction code here: [Tune] Multi GPU, Multi Node hyperparameter search not functioning · Issue #38505 · ray-project/ray · GitHub
I tried the suggestions but am still struggling. Any advice would help


Hi @Sidharth_Baskaran,

yes, each worker (specified by ScalingConfig.num_workers) runs the training loop in a separate process.

You can start sub-tasks and actors from the training loop, but it requires proper resource management. You can take a look at this thread for a bit more background. Note that the way proposed in the thread can lead to resource deadlock if you’re running multiple trials (e.g. with Ray Tune).

Generally, starting Ray Train from within an actor/a function should work without problems (assuming no special scheduling strategy is used).

If you share more context on what you’re trying to do, we can see if we can help you with the correct setup.

As a side node, Ray generally does not work well with multiprocessing fork.

@f2010126 I think your problem is unrelated. I’ve replied to the GitHub issue, let’s continue the discussion there.

1 Like