We’re running Ray in a Kubernetes cluster, and from time to time we get the following warning:
WARNING worker.py:1114 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Task ID: 874253cc82fde8d4ffffffffffffffffffffffff02000000 Worker ID: 98dcd78c9bebbc2176a14fc230b40afdd574dea608dd573aebeae00e Node ID: d4a7b594aec2970b02dde480f2eb2c6070b7886b333d8f1cb1087758 Worker IP address: 10.40.4.2 Worker port: 10010 Worker PID: 307
Since we’re running ray with autoscaling, the nodes where the problem happened typically is removed (scaled down) when we get to inspect why the worker dies, thus the logs for the dead worker is gone.
Is there any general tips here persist the logs?
One solution I thought of was to set the
--temp-dir to point to a shared filestore that is mounted to all the nodes (i.e. this would be mounted both on the driver node, the head node and the worker nodes), but I’m not sure if it’s a good idea?