I’m using TorchTrainer to train a pytorch model. And it always gives me the same error in the middle of training:
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: RayTrainWorker
The actor is dead because its node has died.
What might be the possible root cause? how to find it out?
I’m looking at the raylet.out and gcs_server.out but I don’t know what I should be looking for.