How to create a global variable / lock when using the Ray trainer?

Vedant_Roy · October 31, 2022, 12:11am

I’m trying to use Ray with torchsnapshot: GitHub - pytorch/torchsnapshot: A performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind..
For this to work, all processes must call Snapshot.take at the same time:

TorchSnapshot supports distributed applications as first class citizens. To take a snapshot of a distributed application, simply invoke Snapshot.take() on all ranks simultaneously (similar to calling a torch.distributed API). The persisted application state will be organized as a single snapshot.

I’m trying to setup my code such that–when an EC2 instance receives an instance termination warning–the code will take a snapshot.

For this to work, I need some sort of global variable that can be turned on/off to indicate that a snapshot should be taken.

Would the correct thing to use here be a Ray named actor?

Yard1 · October 31, 2022, 11:22pm

A named actor would work here. You can also create a normal actor, and pass a reference to it in the train loop config.

Topic		Replies	Views
Modified global variable between consecutive remote calls Ray Core	3	444	March 1, 2021
Synchronize ray.remote with a global state actor Ray Core	3	259	December 19, 2023
Global variables to maintain a worker-specific state Ray Core	3	280	September 27, 2023
Distributed torch model training with Ray Core APIs Ray Core	3	513	November 3, 2023
Best practice to share a torch model across actors Ray Core	4	705	December 27, 2022

How to create a global variable / lock when using the Ray trainer?

Related topics