Pytorch DistributedTrainable Tune Report on Rank 0 Only

Matthew · February 23, 2022, 12:22am

We are running deep reinforcement learning so our data comes from simulations running in child processes spawned by each rank. Also, we have custom distributed communication points to save learning curves and observations from the simulations. My goal is to wrap ray tune around that without changing a lot.

I have things working other than the checkpointing. See Checkpoint Discussion.

Topic		Replies	Views
Distributed Training with tune Ray Tune	3	328	May 21, 2021
Tune with Function API and torch.multiprocessing.spawn	0	290	February 6, 2024
[Tune] Report at every epoch as well as after all epochs Ray Tune	0	280	November 30, 2023
AttributeError: module 'ray.tune' has no attribute 'report'	2	2631	April 12, 2024
How can I synchronization metrics in `ray.train` valid_loop	15	614	March 1, 2023

Pytorch DistributedTrainable Tune Report on Rank 0 Only

Related topics