How severe does this issue affect your experience of using Ray?
- None: Just asking a question out of curiosity
KubeRay documentation mentions in passing that it’s “recommended to configure your RayCluster so that only one Ray Pod fits per Kubernetes node”, but it does not provide a rationale for that. I see that similar advice has also beed given on this forum by Ray team members. I am wondering what is behind such recommendation?
On the face of it, it is counterproductive. Running several worker pods on a single node in a Kubernetes cluster with an autoscaler may help to better utilise underlying cloud resources. To illustrate, let’s suppose Ray cluster is expected to scale dynamically based on workload in the range of 0-2000 CPU cores and we adhere to the above recommendation. It’s not obvious how to optimally size the workers in such case. If I go with larger workers (e.g., AWS c7i.48xlarge
EC2 instance type provides 192 vCPUs), then I’m overpaying for idle cores. For example, if I’m running a lengthy job only using 10 CPUs, then I’m paying nearly 10x for the capacity I consume. On the other hand, if I decide to go with smaller workers (e.g., c7i.4xlarge
with 16 vCPUs), then I’m potentially adding a lot of load on the control plane of the Kubernetes cluster and probably to Ray head node as well. For example, cluster running at full capacity of 2000 cores will need 124 nodes! As a result, Ray cluster configuration needs to be more complex, providing a selection of worker groups depending on the workload characteristics, and Ray cluster users need to be trained to understand these nuances.
On the other hand, allowing Kubernetes cluster autoscaler to decide what size of nodes to run makes node sizing decisions transparent for end users. For example, Karpenter can dynamically determine optimal sizes for compute nodes based on the workload, and even rebalance the pods to optimise costs. I can just go with, say, 32 CPUs per Ray worker, and let Karpenter decide, depending on how many of workers are needed, what is the optimal instance type. It can even be configured to disrupt and consolidate worker pods on fewer nodes to avoid fragmentation.
That said, I’m not fully aware of the trade-off involved in these decisions from the perspective of Ray cluster performance.