Launching distributed training from within an actor

Hi @Sidharth_Baskaran,

yes, each worker (specified by ScalingConfig.num_workers) runs the training loop in a separate process.

You can start sub-tasks and actors from the training loop, but it requires proper resource management. You can take a look at this thread for a bit more background. Note that the way proposed in the thread can lead to resource deadlock if you’re running multiple trials (e.g. with Ray Tune).

Generally, starting Ray Train from within an actor/a function should work without problems (assuming no special scheduling strategy is used).

If you share more context on what you’re trying to do, we can see if we can help you with the correct setup.

As a side node, Ray generally does not work well with multiprocessing fork.