When I restart and load the check point using from_checkpoint, it just hangs there, the dash board shows that 30 or so Waiting for scheduling.
Error loading checkpoint: Placement group creation timed out. Make sure your cluster either has enough resources or use an autoscaling cluster. If you are running on a cluster, make sure you specify an address in ray.init()
, for example, ray.init("auto")
. You can also increase the timeout by setting the TRAIN_PLACEMENT_GROUP_TIMEOUT_S environment variable.
Note that there is no lack of resources.