Tasks are completed but ray.exceptions.WorkerCrashedError

For heavy workloads, a large head node annotated with rayResources: { "CPU": 0 } is a good idea.

If you’re seeing workers crashing, it might be a good idea to reduce the number of tasks that run concurrently on the workers by increasing the resource annotations on the task.
For example if a task is annotated @ray.remote(num_cpus=2) there will be at most 3 concurrent instances of that task on a CPU:6 worker node.
@ray.remote is equivalent to @ray.remote(num_cpus=1), which would result in up to 6 concurrent instances of the task.

In general, I recommend using a few large Ray pods vs many small ones – if possible, size the Ray pods to take up entire Kubernetes nodes.