Ray Serve Pods Scheduling Failing

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I am deploying models using Ray Serve and Kuberay on k8s cluster with autoscaling enabled on nodes. Each Node is having 8 CPU and 16 GB of memory, and at max we can have 10 nodes. My issue is when I am configuring head worker with 6 CPU it’s not able to deploy head on none of the node. I suspect it’s taking sum of head and worker resource limits for deployment, because head with 5 CPU and worker with 2 CPU is working. Same I am observing with memory configurations.

Is it doing it by design(if yes then can we know why), or it’s a known bug? What is the workaround here if I want to have my head and worker with high resource configurations.

Here are the configurations which I am trying to set

resources:
  rayHead:
    limits:
      cpu: "6"
      memory: "8Gi"
    requests:
      cpu: "6"
      memory: "8Gi"
  rayWorker: 
    limits:
      cpu: "5"
      memory: "6Gi"
    requests:
      cpu: "5"
      memory: "6Gi"

minWorkerReplica: 0
maxWorkerReplica: 5

The configuration which is working currently,

resources:
  rayHead:
    limits:
      cpu: "5"
      memory: "8Gi"
    requests:
      cpu: "5"
      memory: "8Gi"
  rayWorker: 
    limits:
      cpu: "2"
      memory: "4Gi"
    requests:
      cpu: "2"
      memory: "4Gi"

minWorkerReplica: 0
maxWorkerReplica: 5

Can you share the scheduling error in the head node logs to see what it’s saying? Is it erroring with just “Head Node not scheduable”?

This is the sceduling logs for head node as it’s first trying to spin up head node,

Warning  FailedScheduling   18s   default-scheduler   0/14 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 11 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {node.kubernetes.io/unreachable: }. preemption: 0/14 nodes are available: 1 No preemption victims found for incoming pod, 13 Preemption is not helpful for scheduling..
Normal   NotTriggerScaleUp  8s    cluster-autoscaler  pod didn't trigger scale-up: 3 node(s) didn't match Pod's node affinity/selector, 1 Insufficient cpu, 1 Insufficient memory, 2 max node group size reached

Based on your Pod’s scheduling events, the issue is not related to computation resources. Please check your Kubernetes pod scheduling configurations, such as nodeSelector, taint, tolerations, and so on.