For heavy workloads, a large head node annotated with rayResources: { "CPU": 0 }
is a good idea.
If you’re seeing workers crashing, it might be a good idea to reduce the number of tasks that run concurrently on the workers by increasing the resource annotations on the task.
For example if a task is annotated @ray.remote(num_cpus=2)
there will be at most 3 concurrent instances of that task on a CPU:6
worker node.
@ray.remote
is equivalent to @ray.remote(num_cpus=1)
, which would result in up to 6 concurrent instances of the task.
In general, I recommend using a few large Ray pods vs many small ones – if possible, size the Ray pods to take up entire Kubernetes nodes.