Distributed Training with tune

Currently, I have a deep reinforcement learning framework setup where a pytorch model is trained with DistributedDataParallel and the data comes from interacting with a simulator. These simulators are run inside child processes spawned by the distributed ranks.

I would like to apply tune to tune the hyperparameter, but I’m having difficulty using tune.report. Specifically, I want the child process to be able to connect to the training instance, created in a parent process.

Is there a way to report metrics from a child process to an ancester trainer?

Can you try using the Distributed Trainable Creator?

Here is an example: ddp_mnist_torch — Ray v2.0.0.dev0

No, this would create the workers responsible for the gradient updating, but each of these need to create child processes that interact with the environment. These children compute the metrics that i want to send to tune.

Ok I see. Can you instead use the parent trainer to report the metrics, while use a Multiprocessing Queue to produce/consume metrics from the child process?