Ray Serve Pods Scheduling Failing

Ritesh_K · June 26, 2024, 6:31am

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I am deploying models using Ray Serve and Kuberay on k8s cluster with autoscaling enabled on nodes. Each Node is having 8 CPU and 16 GB of memory, and at max we can have 10 nodes. My issue is when I am configuring head worker with 6 CPU it’s not able to deploy head on none of the node. I suspect it’s taking sum of head and worker resource limits for deployment, because head with 5 CPU and worker with 2 CPU is working. Same I am observing with memory configurations.

Is it doing it by design(if yes then can we know why), or it’s a known bug? What is the workaround here if I want to have my head and worker with high resource configurations.

Here are the configurations which I am trying to set

resources:
  rayHead:
    limits:
      cpu: "6"
      memory: "8Gi"
    requests:
      cpu: "6"
      memory: "8Gi"
  rayWorker: 
    limits:
      cpu: "5"
      memory: "6Gi"
    requests:
      cpu: "5"
      memory: "6Gi"

minWorkerReplica: 0
maxWorkerReplica: 5

The configuration which is working currently,

resources:
  rayHead:
    limits:
      cpu: "5"
      memory: "8Gi"
    requests:
      cpu: "5"
      memory: "8Gi"
  rayWorker: 
    limits:
      cpu: "2"
      memory: "4Gi"
    requests:
      cpu: "2"
      memory: "4Gi"

minWorkerReplica: 0
maxWorkerReplica: 5

Sam_Chan · June 26, 2024, 5:55pm

Can you share the scheduling error in the head node logs to see what it’s saying? Is it erroring with just “Head Node not scheduable”?

Ritesh_K · June 27, 2024, 4:16am

This is the sceduling logs for head node as it’s first trying to spin up head node,

Warning  FailedScheduling   18s   default-scheduler   0/14 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 11 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {node.kubernetes.io/unreachable: }. preemption: 0/14 nodes are available: 1 No preemption victims found for incoming pod, 13 Preemption is not helpful for scheduling..
Normal   NotTriggerScaleUp  8s    cluster-autoscaler  pod didn't trigger scale-up: 3 node(s) didn't match Pod's node affinity/selector, 1 Insufficient cpu, 1 Insufficient memory, 2 max node group size reached

Kai-Hsun_Chen · July 26, 2024, 10:29pm

Based on your Pod’s scheduling events, the issue is not related to computation resources. Please check your Kubernetes pod scheduling configurations, such as nodeSelector, taint, tolerations, and so on.

Topic		Replies	Views
Scale up from 0 Ray Clusters	7	569	July 15, 2021
Autoscaler does not scale in ray1.4 with 0 CPUs allocated head node Kubernetes	1	478	July 27, 2021
Some questions about Ray on Kubernetes Ray Clusters	3	798	December 3, 2021
Ray on k8s, how to properly config head node Ray Clusters	4	986	June 24, 2022
[Clusters] [Core] Head node max_workers is not respected Ray Clusters	2	291	June 17, 2021

Ray Serve Pods Scheduling Failing

Related topics