Finding worker logs on (auto)scaled down kubernetes nodes / using shared temp_dir

simenandresen · June 17, 2021, 8:21am

Hi,

We’re running Ray in a Kubernetes cluster, and from time to time we get the following warning:

WARNING worker.py:1114 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Task ID: 874253cc82fde8d4ffffffffffffffffffffffff02000000 Worker ID: 98dcd78c9bebbc2176a14fc230b40afdd574dea608dd573aebeae00e Node ID: d4a7b594aec2970b02dde480f2eb2c6070b7886b333d8f1cb1087758 Worker IP address: 10.40.4.2 Worker port: 10010 Worker PID: 307

Since we’re running ray with autoscaling, the nodes where the problem happened typically is removed (scaled down) when we get to inspect why the worker dies, thus the logs for the dead worker is gone.

Is there any general tips here persist the logs?

One solution I thought of was to set the --temp-dir to point to a shared filestore that is mounted to all the nodes (i.e. this would be mounted both on the driver node, the head node and the worker nodes), but I’m not sure if it’s a good idea?

rliaw · June 22, 2021, 4:26pm

Yeah, pointing the --temp-dir to the shared file store should work as you expect.

simenandresen · June 24, 2021, 7:18am

Thanks @rliaw. I did get some permission errors when specifying the --temp-dir directly in the config.yaml file. Probably just need to set the right permissions on the directory or something. I’ll do some more testing

rliaw · July 12, 2021, 7:21pm

Let me know how things went!

simenandresen · July 12, 2021, 7:26pm

It worked fine . Just had to create the directory beforehand and set the right permissions.

Topic		Replies	Views
Job logs deleted once worker pods are scaled down Ray Clusters	1	62	August 13, 2024
Task assignment to multiple workers in autoscaler Kubernetes	4	450	May 2, 2021
[Autoscaler] Autoscaler on ray 1.3 with minikube does not scale down Ray Clusters	2	383	June 3, 2021
[Cluster][Autoscaler-v2]-Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout Kubernetes	0	34	September 10, 2024
Autoscaler node termination behavior when scaled down with helm Kubernetes	4	766	July 22, 2021

Finding worker logs on (auto)scaled down kubernetes nodes / using shared temp_dir

Related topics