Tasks are completed but ray.exceptions.WorkerCrashedError

Dmitri · April 28, 2022, 2:48am

For heavy workloads, a large head node annotated with rayResources: { "CPU": 0 } is a good idea.

If you’re seeing workers crashing, it might be a good idea to reduce the number of tasks that run concurrently on the workers by increasing the resource annotations on the task.
For example if a task is annotated @ray.remote(num_cpus=2) there will be at most 3 concurrent instances of that task on a CPU:6 worker node.
@ray.remote is equivalent to @ray.remote(num_cpus=1), which would result in up to 6 concurrent instances of the task.

In general, I recommend using a few large Ray pods vs many small ones – if possible, size the Ray pods to take up entire Kubernetes nodes.

Topic		Replies	Views
Ray / gRPC Ambiguous Error Message Kubernetes	12	2206	May 13, 2022
Ray head and ray training worker pods are crashing intermittently Kubernetes	3	179	August 9, 2024
Ray Cluster on a Docker Swarm (manual setup) Ray Clusters	0	712	April 27, 2022
Head and worked node dies after few seconds Kubernetes	3	1179	March 24, 2023
Subset of tasks stuck in "PENDING_NODE_ASSIGNMENT" forever Ray Clusters	9	2170	May 25, 2023

Tasks are completed but ray.exceptions.WorkerCrashedError

Related topics