Hey Team,
We are noticing frequent crashing of ray head and ray training worker pods in our cluster, here is the error for you to refer:
Ray Head error :
`"2024-08-07 06:15:41.515 `
`2024-08-07T00:45:40.838145482Z stdout F [2024-08-07 00:45:40,838 W 48 87] (gcs_server) server_call.h:216: Wrong cluster ID token in request! Expected: 0444302cb8bbe848f5c9181c68df2aeed53b7f35e776bc0f2af640bd, but got: 18996f33050b13de10fbdc073ee1711a61f2ba3d980e6a1266897a77"`
Ray training worker error :
`2024-08-07T00:45:40.008568666Z stdout F /pyenv/versions/3.9.17/lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0xb83a1a) [0x55fa69f38a1a] ray::operator<<()`
`2024-08-07 06:15:40.192 `
`2024-08-07T00:45:40.008562966Z stdout F *** StackTrace Information ***`
`2024-08-07 06:15:40.192 `
`2024-08-07T00:45:40.008515075Z stdout F [2024-08-07 00:45:40,008 C 69 69] (raylet) node_manager.cc:565: GCS returned an authentication error. This may happen when GCS is not backed by a DB and restarted or there is data loss in the DB. Local cluster ID: 18996f33050b13de10fbdc073ee1711a61f2ba3d980e6a1266897a77`
Any insight here will be helpful
–vp
These errors are not the root cause. This happens when Ray head restarts and generates new cluster id, but workers try to reconnect to it assuming they are in the old cluster. Search for errors in Ray head or your cluster indicating why it’s restarting. It could be many things, for example if could be out of memory errors, or forced evictions, especially, if you are running in a public cloud on spot instances, but you need to establish the root cause.
You didn’t say how often the restarts happen. If you are running very long jobs and spot evictions are indeed causing issues, you may benefit from implementing fault tolerance for GCS and/or configuring your cluster to prevent evictions of Ray heads.
1 Like
Thanks for your reply. The failures are seen when couple of jobs are queued up, say for example 5+. They are all long running jobs for sure.
Another error that we noticed while investigating, on ray-head yesterday during the pod crash and today as well, while nothing crashed at all(although we intentionally queued less job today):
2024-08-09 13:15:49.405
2024-08-09T07:45:48.530318557Z stdout F [2024-08-09 07:45:48,530 W 48 48] (gcs_server) gcs_actor_manager.cc:970: Worker 3f4bc19d8f25c6e9a8107323e719b01351bfd90f8b9d5c5cda1f66ea on node 04e00adca59173a6b987f74fb8fe2c9b0aa6df74c4d2643b86c92d32 exits, type=SYSTEM_ERROR, has creation_task_exception = 0
2024-08-09 13:15:49.405
2024-08-09T07:45:48.530289452Z stdout F [2024-08-09 07:45:48,530 W 48 48] (gcs_server) gcs_worker_manager.cc:55: Reporting worker exit, worker id = 3f4bc19d8f25c6e9a8107323e719b01351bfd90f8b9d5c5cda1f66ea, node id = 04e00adca59173a6b987f74fb8fe2c9b0aa6df74c4d2643b86c92d32, address = 10.170.21.140, exit_type = SYSTEM_ERROR, exit_detail = Worker exits unexpectedly. Worker exits with an exit code 1. The worker may have exceeded K8s pod memory limits. The process receives a SIGTERM.. Unintentional worker failures have been reported. If there are lots of this logs, that might indicate there are unexpected failures in the cluster.
2024-08-09 13:15:48.411
2024-08-09T07:45:48.328946896Z stdout F :actor_name:JobSupervisor
So I am guessing this too is a red-herring?
This indicates that a worker has crashed. Non-zero exit code indicates abnormal process termination. You may get some hint in the worker logs at that time, but not necessarily, depending on the reason. If you are running Ray on Kubernetes worth checking worker pod status to see if they are getting OOMKilled and also checking Kubernetes events to see if anything happening with your nodes.