Distributed Training with tune

Matthew · May 21, 2021, 9:31pm

Currently, I have a deep reinforcement learning framework setup where a pytorch model is trained with DistributedDataParallel and the data comes from interacting with a simulator. These simulators are run inside child processes spawned by the distributed ranks.

I would like to apply tune to tune the hyperparameter, but I’m having difficulty using tune.report. Specifically, I want the child process to be able to connect to the training instance, created in a parent process.

Is there a way to report metrics from a child process to an ancester trainer?

rliaw · May 21, 2021, 9:42pm

Can you try using the Distributed Trainable Creator?

Here is an example: https://docs.ray.io/en/master/tune/examples/ddp_mnist_torch.html

Matthew · May 21, 2021, 9:50pm

No, this would create the workers responsible for the gradient updating, but each of these need to create child processes that interact with the environment. These children compute the metrics that i want to send to tune.

rliaw · May 21, 2021, 10:39pm

Ok I see. Can you instead use the parent trainer to report the metrics, while use a Multiprocessing Queue to produce/consume metrics from the child process?

Topic		Replies	Views
Pytorch DistributedTrainable Tune Report on Rank 0 Only Ray Tune	6	1228	February 23, 2022
Tune with Function API and torch.multiprocessing.spawn	0	290	February 6, 2024
Ray + torch.distributed/DDP resource management	1	1161	September 21, 2022
[Ray] How to implement distributed DDP in pytorch using only pytorch And ray? Ray Tune	1	833	July 28, 2021
What is the right way of using Ray tune with Pytorch DDP Ray Tune	1	926	February 23, 2024

Distributed Training with tune

Related topics