Serve autoscaling in EKS

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I am trying to use Ray serve for model serving in EKS, I am using a managed node group (GPU instances) in EKS which has 2 desired instance and can scale upto 8. In the ray config file I have set global max workers as 8.
When the load increases I do not see new instances being added, is there a way to configure it based on the GPU usage?

Hi @parth_c you’ll need to use Serve autoscaling in order for the cluster to scale up/down based on load. This should cause the number of serve replicas to increase with load, triggering the cluster to scale up and the new replicas to be placed.

Hi @eoakes,
Thank you for getting back! I am using serve autoscaling but the issue is that nodes are not being added to the cluster and thus the pods do not find resources. I am under the assumption that adding and removing nodes is managed by ray under K8…

I see. Can you check the autoscaler log at /tmp/ray/session_latest/logs/monitor.* to see what it says? You should see it trying to start up nodes. If that’s the case and they’re pending, then it’s likely an issue with your EKS setup (you could also look for pending pods in the k8s cluster).

The autoscaler does try to start up nodes. I can see pending pods in the cluster. However, the cluster is still not able to add nodes. I am using managed EKS nodes.
Here is the autoscaler log. But I don’t see any pending “Nodes” in cluster or K8 dashboard

---------------------------------------------------------------
Usage:
 2.0/4.0 CPU
 2.0/2.0 GPU
 0.0/2.0 accelerator_type:T4
 0.00/14.000 GiB memory
 0.00/3.051 GiB object_store_memory

Demands:
 {'CPU': 1.0, 'GPU': 1.0}: 1+ pending tasks/actors
2022-03-26 10:18:34,434 INFO autoscaler.py:1154 -- StandardAutoscaler: Queue 1 new nodes for launch
2022-03-26 10:18:34,435 INFO node_launcher.py:110 -- NodeLauncher0: Got 1 nodes to launch.
2022-03-26 10:18:34,435 INFO node_launcher.py:110 -- NodeLauncher0: Launching 1 nodes, type worker_node.
2022-03-26 10:18:34,436 INFO node_provider.py:142 -- KubernetesNodeProvider: calling create_namespaced_pod (count=1).
2022-03-26 10:18:34,471 INFO monitor.py:362 -- :event_summary:Adding 1 nodes of type worker_node.

==> /tmp/ray/session_latest/logs/monitor.err <==
Unable to use a TTY - input is not a terminal or the right kind of file
Error from server (BadRequest): pod ray-worker-5jm7d does not have a host assigned

==> /tmp/ray/session_latest/logs/monitor.log <==
2022-03-26 10:18:39,646 INFO autoscaler.py:304 -- 
======== Autoscaler status: 2022-03-26 10:18:39.645940 ========
Node status
---------------------------------------------------------------

We reply on the underlying EKS cluster to satisfy the demand here. I would proceed with following step to debug:

  • Check kubectl get pods to see whether there are pending pods
  • kubectl describe the pending pod to verify it is pending for physical ndoes
  • kubectl get nodes to see current and upcoming nodes
  • Check your AWS EKS node group limit and autoscaling group to verify how many nodes are allowed to start up and/or starting up.

@parth_c do you have resource requests set for your pods? I’ve been playing with Ray on EKS and the new http://kartpenter.sh cluster autoscaler works well with it. By using custom node provisioners I can spin up GPU nodes for Ray workloads that require it.