I’m trying to use Ray with torchsnapshot: GitHub - pytorch/torchsnapshot: A performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind..
For this to work, all processes must call Snapshot.take
at the same time:
TorchSnapshot supports distributed applications as first class citizens. To take a snapshot of a distributed application, simply invoke
Snapshot.take()
on all ranks simultaneously (similar to calling a torch.distributed API). The persisted application state will be organized as a single snapshot.
I’m trying to setup my code such that–when an EC2 instance receives an instance termination warning–the code will take a snapshot.
For this to work, I need some sort of global variable that can be turned on/off to indicate that a snapshot should be taken.
Would the correct thing to use here be a Ray named actor?