Root cause the actor is dead because its node has died

I’m using TorchTrainer to train a pytorch model. And it always gives me the same error in the middle of training:
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: RayTrainWorker
The actor is dead because its node has died.

What might be the possible root cause? how to find it out?
I’m looking at the raylet.out and gcs_server.out but I don’t know what I should be looking for.

Hey @Ziqi_Jiang, unfortunately this is a bit hard to diagnose. There could be a couple of problems:

  1. Did your pod hit an out-of-memory error? (This is likely if it is happening at the same time over and over again)
  2. Did your pod get pre-empted by someone else?

You can also post raylet.err and raylet.out and gcs_server.out/err here for us to help review