How severe does this issue affect your experience of using Ray?
High: It blocks me to complete my task.
I am trying to use Ray serve for model serving in EKS, I am using a managed node group (GPU instances) in EKS which has 2 desired instance and can scale upto 8. In the ray config file I have set global max workers as 8.
When the load increases I do not see new instances being added, is there a way to configure it based on the GPU usage?
Hi @parth_c you’ll need to use Serve autoscaling in order for the cluster to scale up/down based on load. This should cause the number of serve replicas to increase with load, triggering the cluster to scale up and the new replicas to be placed.
Hi @eoakes,
Thank you for getting back! I am using serve autoscaling but the issue is that nodes are not being added to the cluster and thus the pods do not find resources. I am under the assumption that adding and removing nodes is managed by ray under K8…
I see. Can you check the autoscaler log at /tmp/ray/session_latest/logs/monitor.* to see what it says? You should see it trying to start up nodes. If that’s the case and they’re pending, then it’s likely an issue with your EKS setup (you could also look for pending pods in the k8s cluster).
The autoscaler does try to start up nodes. I can see pending pods in the cluster. However, the cluster is still not able to add nodes. I am using managed EKS nodes.
Here is the autoscaler log. But I don’t see any pending “Nodes” in cluster or K8 dashboard
---------------------------------------------------------------
Usage:
2.0/4.0 CPU
2.0/2.0 GPU
0.0/2.0 accelerator_type:T4
0.00/14.000 GiB memory
0.00/3.051 GiB object_store_memory
Demands:
{'CPU': 1.0, 'GPU': 1.0}: 1+ pending tasks/actors
2022-03-26 10:18:34,434 INFO autoscaler.py:1154 -- StandardAutoscaler: Queue 1 new nodes for launch
2022-03-26 10:18:34,435 INFO node_launcher.py:110 -- NodeLauncher0: Got 1 nodes to launch.
2022-03-26 10:18:34,435 INFO node_launcher.py:110 -- NodeLauncher0: Launching 1 nodes, type worker_node.
2022-03-26 10:18:34,436 INFO node_provider.py:142 -- KubernetesNodeProvider: calling create_namespaced_pod (count=1).
2022-03-26 10:18:34,471 INFO monitor.py:362 -- :event_summary:Adding 1 nodes of type worker_node.
==> /tmp/ray/session_latest/logs/monitor.err <==
Unable to use a TTY - input is not a terminal or the right kind of file
Error from server (BadRequest): pod ray-worker-5jm7d does not have a host assigned
==> /tmp/ray/session_latest/logs/monitor.log <==
2022-03-26 10:18:39,646 INFO autoscaler.py:304 --
======== Autoscaler status: 2022-03-26 10:18:39.645940 ========
Node status
---------------------------------------------------------------
@parth_c do you have resource requests set for your pods? I’ve been playing with Ray on EKS and the new http://kartpenter.sh cluster autoscaler works well with it. By using custom node provisioners I can spin up GPU nodes for Ray workloads that require it.
Were you able to get it working ? I am also facing the same issue. Ray Cluster autoscales based on Load. But when it reaches the threshold at the node level, It is unable to scale up new Node in EKS cluster and spin up new worker pods in the new Pod.