What is the rationale for recommending one worker per k8s node

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity

KubeRay documentation mentions in passing that it’s “recommended to configure your RayCluster so that only one Ray Pod fits per Kubernetes node”, but it does not provide a rationale for that. I see that similar advice has also beed given on this forum by Ray team members. I am wondering what is behind such recommendation?

On the face of it, it is counterproductive. Running several worker pods on a single node in a Kubernetes cluster with an autoscaler may help to better utilise underlying cloud resources. To illustrate, let’s suppose Ray cluster is expected to scale dynamically based on workload in the range of 0-2000 CPU cores and we adhere to the above recommendation. It’s not obvious how to optimally size the workers in such case. If I go with larger workers (e.g., AWS c7i.48xlarge EC2 instance type provides 192 vCPUs), then I’m overpaying for idle cores. For example, if I’m running a lengthy job only using 10 CPUs, then I’m paying nearly 10x for the capacity I consume. On the other hand, if I decide to go with smaller workers (e.g., c7i.4xlarge with 16 vCPUs), then I’m potentially adding a lot of load on the control plane of the Kubernetes cluster and probably to Ray head node as well. For example, cluster running at full capacity of 2000 cores will need 124 nodes! As a result, Ray cluster configuration needs to be more complex, providing a selection of worker groups depending on the workload characteristics, and Ray cluster users need to be trained to understand these nuances.

On the other hand, allowing Kubernetes cluster autoscaler to decide what size of nodes to run makes node sizing decisions transparent for end users. For example, Karpenter can dynamically determine optimal sizes for compute nodes based on the workload, and even rebalance the pods to optimise costs. I can just go with, say, 32 CPUs per Ray worker, and let Karpenter decide, depending on how many of workers are needed, what is the optimal instance type. It can even be configured to disrupt and consolidate worker pods on fewer nodes to avoid fragmentation.

That said, I’m not fully aware of the trade-off involved in these decisions from the perspective of Ray cluster performance.

1 Like

This recommendation makes sense for at least three reasons. One technical related to the ray object store and two practical reasons related to the usage of hardware resources by an ray application. If your application is compute bound the ray runtime could utilize all resources of a given node even allowing oversubscription when using fractional resources. Also if the application is somewhat IO bound one ray pod could saturate the network or Io bandwidth of the instance. Due to these using multiple ray pods per node does not really help. The technical reason is that the object store could use zero copy among ray processes of a pod but when using multiple pods, even if collocated on the same node, would be necessary to perform a network communication instead of using the shared memory of the object store.
With respect to the instance sizing problem it is really workload dependent and as you have observed the users need to know these nuances to choose a good instance size

1 Like

Thank you @alan

I see your point about possible network saturation and object store locality. I’d say that if a ray job is IO bound, it doesn’t use distributed compute correctly, but I appreciate sometimes you have no choice and have to keep the cluster running despire poorly written algorithms.

Regarding the use of CPUs, if the worker pods are given hard CPU limits, how could they oversubscribe the node?

Hello @lobanov , The ray runtime can’t oversubscribe a node if limited by the k8s cluster . I was talking in the hypothetical situation of only using one ray pod per node. Using fractional resources, for instance using num_cpus=0.5 for all tasks would create twice the amount of ray workers than the cores available. Again this could be useful if waiting for IO/API request/RL env.