How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I am using RayServe to run inference on EKS cluster and deploying RayWorkers on AWS Neuron(Inferentia2) node. I start with same replica and minReplica in RayService YAML
workerGroupSpecs:
- groupName: inf2-worker-group
replicas: 2
minReplicas: 2
maxReplicas: 8
Now, I also have configured RayServe replica under ServeConfig V2. I am able to run 2 inferentia2 worker nodes each of which have access to underlying neuron accelerators. However when I try to increase no of minReplicas I don’t see any Pending RayWorker Pods. In the head node logs, I see the following message
" 197512024-03-15 10:19:54,242 INFO autoscaler.py:469 – The autoscaler took 0.049 seconds to complete the update iteration.
197502024-03-15 10:19:54,242 WARNING resource_demand_scheduler.py:782 – The autoscaler could not find a node type to satisfy the request: [{‘CPU’: 10.0, ‘neuron_cores’: 2.0}, {‘CPU’: 10.0, ‘neuron_cores’: 2.0}, {‘CPU’: 10.0, ‘neuron_cores’: 2.0}]. Please specify a node type with the necessary resources.
19749 {‘CPU’: 10.0, ‘neuron_cores’: 2.0}: 3+ pending tasks/actors"
Here is snippet of RayService config.YAML
apiVersion: ray.io/v1
kind: RayService
metadata:
name: stablediffusion-service
spec:
serviceUnhealthySecondThreshold: 900
deploymentUnhealthySecondThreshold: 300
serveConfigV2: |
applications:
- name: stable-diffusion-deployment
import_path: "ray_serve_stablediffusion:entrypoint"
route_prefix: "/"
runtime_env:
env_vars:
MODEL_ID: "aws-neuron/stable-diffusion-xl-base-1-0-1024x1024"
NEURON_CC_FLAGS: "-O1"
deployments:
- name: stable-diffusion-v2
autoscaling_config:
metrics_interval_s: 0.2
min_replicas: 15
max_replicas: 20
look_back_period_s: 2
downscale_delay_s: 30
upscale_delay_s: 2
target_num_ongoing_requests_per_replica: 1
graceful_shutdown_timeout_s: 5
max_concurrent_queries: 100
ray_actor_options:
num_cpus: 10
resources: {"neuron_cores": 2}
rayClusterConfig:
rayVersion: '2.9.0'
enableInTreeAutoscaling: true
headGroupSpec:
serviceType: NodePort
headService:
metadata:
name: stablediffusion-service
namespace: stablediffusion
rayStartParams:
dashboard-host: '0.0.0.0'
template:
spec:
containers:
- name: ray-head
image:
imagePullPolicy: Always # Ensure the image is always pulled when updated
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "ray stop"]
resources:
limits:
cpu: "2"
memory: "20G"
requests:
cpu: "2"
memory: "20G"
workerGroupSpecs:
- groupName: inf2-worker-group
replicas: 2
minReplicas: 2
maxReplicas: 8
rayStartParams: {}
template:
spec:
containers:
- name: ray-worker
image:
imagePullPolicy: Always # Ensure the image is always pulled when updated
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "ray stop"]
resources:
limits:
cpu: "90"
memory: "360G"
aws.amazon.com/neuron: "6"
requests:
cpu: "90"
memory: "360G"
aws.amazon.com/neuron: "6" # All Neuron cores of inf2.24xlarge
---
As a note, I’m using Karpenter as my cluster autoscaling solution on EKS. The problem is even if I increase the minReplica, Ray Actors are in “Pending” state with error “The autoscaler could not find a node type to satisfy the request: [{‘CPU’: 10.0, ‘neuron_cores’: 2.0}, {‘CPU’: 10.0, ‘neuron_cores’: 2.0}, {‘CPU’: 10.0, ‘neuron_cores’: 2.0}]. Please specify a node type with the necessary resources.” Shouldn’t Ray set the WorkerGroup replica accordingly once it sees the RayActors/Pods in pending state?
Could anyone please suggest ideas.