I create a cluster constructed by two phsical server, a docker container named “ray-head” is run on one server, and a docker container named “ray-worker” is run on the other one. I can run distributed parallel ray tune program. But error occures When I wanna save checkpoint, the details as follows:
ray.tune.error.TuneError: not found after successful sync down. Are you running on a k8s or managed cluster? rsync will not function due to a lack of SSH functionality. You’ll need to use cloud-checkpointing if that’s the case.
But I have no cloud storage, I only own a cluster with two physical servers.
I have spent a whole day on the problem. Please help me out, thanks in advance.