Finding worker logs on (auto)scaled down kubernetes nodes / using shared temp_dir

Hi,

We’re running Ray in a Kubernetes cluster, and from time to time we get the following warning:

WARNING worker.py:1114 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Task ID: 874253cc82fde8d4ffffffffffffffffffffffff02000000 Worker ID: 98dcd78c9bebbc2176a14fc230b40afdd574dea608dd573aebeae00e Node ID: d4a7b594aec2970b02dde480f2eb2c6070b7886b333d8f1cb1087758 Worker IP address: 10.40.4.2 Worker port: 10010 Worker PID: 307

Since we’re running ray with autoscaling, the nodes where the problem happened typically is removed (scaled down) when we get to inspect why the worker dies, thus the logs for the dead worker is gone.

Is there any general tips here persist the logs?

One solution I thought of was to set the --temp-dir to point to a shared filestore that is mounted to all the nodes (i.e. this would be mounted both on the driver node, the head node and the worker nodes), but I’m not sure if it’s a good idea?

Yeah, pointing the --temp-dir to the shared file store should work as you expect.

1 Like

Thanks @rliaw. I did get some permission errors when specifying the --temp-dir directly in the config.yaml file. Probably just need to set the right permissions on the directory or something. I’ll do some more testing

1 Like

Let me know how things went!

It worked fine :slight_smile: . Just had to create the directory beforehand and set the right permissions.

1 Like