How to create a global variable / lock when using the Ray trainer?

I’m trying to use Ray with torchsnapshot: GitHub - pytorch/torchsnapshot: A performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind..
For this to work, all processes must call Snapshot.take at the same time:

TorchSnapshot supports distributed applications as first class citizens. To take a snapshot of a distributed application, simply invoke Snapshot.take() on all ranks simultaneously (similar to calling a torch.distributed API). The persisted application state will be organized as a single snapshot.

I’m trying to setup my code such that–when an EC2 instance receives an instance termination warning–the code will take a snapshot.

For this to work, I need some sort of global variable that can be turned on/off to indicate that a snapshot should be taken.

Would the correct thing to use here be a Ray named actor?

A named actor would work here. You can also create a normal actor, and pass a reference to it in the train loop config.