Launching distributed training from within an actor

kai · August 30, 2023, 8:41am

yes, each worker (specified by ScalingConfig.num_workers) runs the training loop in a separate process.

You can start sub-tasks and actors from the training loop, but it requires proper resource management. You can take a look at this thread for a bit more background. Note that the way proposed in the thread can lead to resource deadlock if you’re running multiple trials (e.g. with Ray Tune).

Generally, starting Ray Train from within an actor/a function should work without problems (assuming no special scheduling strategy is used).

If you share more context on what you’re trying to do, we can see if we can help you with the correct setup.

As a side node, Ray generally does not work well with multiprocessing fork.

Topic		Replies	Views
Distributed torch model training with Ray Core APIs Ray Core	3	597	November 3, 2023
[Ray] How to implement distributed DDP in pytorch using only pytorch And ray? Ray Tune	1	854	July 28, 2021
Ray Tune for single-node distributed training in PyTorch Ray Tune	3	1023	August 24, 2021
Ray multiprocessing together with distributed learning Ray Train	1	568	March 2, 2022
Ray actor multiple gpu available but only one used Ray Core	3	148	October 4, 2024

Launching distributed training from within an actor

Related topics