Serve autoscaling in EKS

parth_c · March 22, 2022, 8:25am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I am trying to use Ray serve for model serving in EKS, I am using a managed node group (GPU instances) in EKS which has 2 desired instance and can scale upto 8. In the ray config file I have set global max workers as 8.
When the load increases I do not see new instances being added, is there a way to configure it based on the GPU usage?

eoakes · March 22, 2022, 6:12pm

Hi @parth_c you’ll need to use Serve autoscaling in order for the cluster to scale up/down based on load. This should cause the number of serve replicas to increase with load, triggering the cluster to scale up and the new replicas to be placed.

parth_c · March 23, 2022, 6:05am

Hi @eoakes,
Thank you for getting back! I am using serve autoscaling but the issue is that nodes are not being added to the cluster and thus the pods do not find resources. I am under the assumption that adding and removing nodes is managed by ray under K8…

eoakes · March 23, 2022, 4:52pm

I see. Can you check the autoscaler log at /tmp/ray/session_latest/logs/monitor.* to see what it says? You should see it trying to start up nodes. If that’s the case and they’re pending, then it’s likely an issue with your EKS setup (you could also look for pending pods in the k8s cluster).

parth_c · March 26, 2022, 5:27pm

The autoscaler does try to start up nodes. I can see pending pods in the cluster. However, the cluster is still not able to add nodes. I am using managed EKS nodes.
Here is the autoscaler log. But I don’t see any pending “Nodes” in cluster or K8 dashboard

---------------------------------------------------------------
Usage:
 2.0/4.0 CPU
 2.0/2.0 GPU
 0.0/2.0 accelerator_type:T4
 0.00/14.000 GiB memory
 0.00/3.051 GiB object_store_memory

Demands:
 {'CPU': 1.0, 'GPU': 1.0}: 1+ pending tasks/actors
2022-03-26 10:18:34,434 INFO autoscaler.py:1154 -- StandardAutoscaler: Queue 1 new nodes for launch
2022-03-26 10:18:34,435 INFO node_launcher.py:110 -- NodeLauncher0: Got 1 nodes to launch.
2022-03-26 10:18:34,435 INFO node_launcher.py:110 -- NodeLauncher0: Launching 1 nodes, type worker_node.
2022-03-26 10:18:34,436 INFO node_provider.py:142 -- KubernetesNodeProvider: calling create_namespaced_pod (count=1).
2022-03-26 10:18:34,471 INFO monitor.py:362 -- :event_summary:Adding 1 nodes of type worker_node.

==> /tmp/ray/session_latest/logs/monitor.err <==
Unable to use a TTY - input is not a terminal or the right kind of file
Error from server (BadRequest): pod ray-worker-5jm7d does not have a host assigned

==> /tmp/ray/session_latest/logs/monitor.log <==
2022-03-26 10:18:39,646 INFO autoscaler.py:304 -- 
======== Autoscaler status: 2022-03-26 10:18:39.645940 ========
Node status
---------------------------------------------------------------

simon-mo · March 29, 2022, 9:45pm

We reply on the underlying EKS cluster to satisfy the demand here. I would proceed with following step to debug:

Check kubectl get pods to see whether there are pending pods
kubectl describe the pending pod to verify it is pending for physical ndoes
kubectl get nodes to see current and upcoming nodes
Check your AWS EKS node group limit and autoscaling group to verify how many nodes are allowed to start up and/or starting up.

mkotsalainen · April 7, 2022, 12:32pm

@parth_c do you have resource requests set for your pods? I’ve been playing with Ray on EKS and the new http://kartpenter.sh cluster autoscaler works well with it. By using custom node provisioners I can spin up GPU nodes for Ray workloads that require it.

Nagaraja_Varnekar · June 3, 2024, 6:08pm

Hi Parth,

Were you able to get it working ? I am also facing the same issue. Ray Cluster autoscales based on Load. But when it reaches the threshold at the node level, It is unable to scale up new Node in EKS cluster and spin up new worker pods in the new Pod.

Topic		Replies	Views
RayServe Autoscaling not creating Ray Pods Ray Serve	3	290	March 29, 2024
Autoscaling RayServe Pods in k8s keeps terminating and restarting pods Ray Serve	4	729	November 20, 2023
Ray Serve replica level autoscaling not working with Kube deployment Ray Serve	3	29	June 11, 2025
Scaling Ray Serve efficiently Ray Serve	0	48	December 10, 2024
Error Scaling Ray Serve to 2 Replicas Ray Serve	11	1457	August 11, 2021

Serve autoscaling in EKS

Related topics