How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I want to run a relatively small transformers model on GPUs in GCP on a k8s cluster. For this, I have written a RayServe service that works locally as well as in the k8s cluster. The GCP k8s cluster has up to 12 GPU nodes available with 1 T4 GPU and 4 CPUs per node, Autoscaling is enabled as well. Creating a GPU based worker pod leads the GCP autoscaler to scale the GPU nodes up from 0, this upscaling takes a few minutes. After uploading my k8s YAML manifest, the service starts and is working as expected.
But as soon as I load test I get problems. The ray cluster tries to autoscale, but keeps sending this log message:
INFO 2023-08-29 05:00:44,015 controller 257 deployment_state.py:1725 - Replica app1_APIIngress#AXsiNL started successfully on node 2e9f3ae0d98736ca8ec38c1c3924708a6098367fe42d434e8471c3fe.
INFO 2023-08-29 05:01:03,930 controller 257 http_state.py:436 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-a24da4fba7b62a2283cdee682caf117149571f57f31bd6256b816b8e' on node 'a24da4fba7b62a2283cdee682caf117149571f57f31bd6256b816b8e' listening on '0.0.0.0:8000'
WARNING 2023-08-29 05:01:10,889 controller 257 deployment_state.py:1902 - Deployment app1_LanguageDetectionModel has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method.
INFO 2023-08-29 05:01:18,785 controller 257 deployment_state.py:1725 - Replica app1_LanguageDetectionModel#WvLDiy started successfully on node a24da4fba7b62a2283cdee682caf117149571f57f31bd6256b816b8e.
INFO 2023-08-29 05:02:29,342 controller 257 deployment_state.py:1396 - Autoscaling deployment app1_LanguageDetectionModel replicas from 1 to 4. Current ongoing requests: [5.333333333333333], current handle queued queries: 0.
INFO 2023-08-29 05:02:29,344 controller 257 deployment_state.py:1571 - Adding 3 replicas to deployment app1_LanguageDetectionModel.
INFO 2023-08-29 05:02:29,344 controller 257 deployment_state.py:353 - Starting replica app1_LanguageDetectionModel#WVBenk for deployment app1_LanguageDetectionModel.
INFO 2023-08-29 05:02:29,366 controller 257 deployment_state.py:353 - Starting replica app1_LanguageDetectionModel#bsytyf for deployment app1_LanguageDetectionModel.
INFO 2023-08-29 05:02:29,371 controller 257 deployment_state.py:353 - Starting replica app1_LanguageDetectionModel#gKCiKi for deployment app1_LanguageDetectionModel.
WARNING 2023-08-29 05:02:59,385 controller 257 deployment_state.py:1882 - Deployment "app1_LanguageDetectionModel" has 3 replicas that have taken more than 30s to be scheduled. This may be caused by waiting for the cluster to auto-scale, or waiting for a runtime environment to install. Resources required for each replica: {"CPU": 4.0, "GPU": 1.0, "accelerator_type:T4": 0.001}, resources available: {"accelerator_type:T4": 0.999, "CPU": 3.0}.
WARNING 2023-08-29 05:03:29,409 controller 257 deployment_state.py:1882 - Deployment "app1_LanguageDetectionModel" has 3 replicas that have taken more than 30s to be scheduled. This may be caused by waiting for the cluster to auto-scale, or waiting for a runtime environment to install. Resources required for each replica: {"CPU": 4.0, "GPU": 1.0, "accelerator_type:T4": 0.001}, resources available: {"accelerator_type:T4": 0.999, "CPU": 3.0}.
WARNING 2023-08-29 05:03:59,478 controller 257 deployment_state.py:1882 - Deployment "app1_LanguageDetectionModel" has 3 replicas that have taken more than 30s to be scheduled. This may be caused by waiting for the cluster to auto-scale, or waiting for a runtime environment to install. Resources required for each replica: {"CPU": 4.0, "GPU": 1.0, "accelerator_type:T4": 0.001}, resources available: {"accelerator_type:T4": 0.999, "CPU": 3.0}.
k8s tries to start a second Ray Worker and Head. The already working Ray cluster head is then terminated in k8s and the new one is available. But I have not yet gotten it to actually run more than 1 instance of the workers at the same time to distribute the test request to.
My k8s YAML manifest currently looks like this:
apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
name: language-detection
spec:
serviceUnhealthySecondThreshold: 900
deploymentUnhealthySecondThreshold: 300
serveConfigV2: |
applications:
- name: app1
route_prefix: /
import_path: service:entrypoint
deployments:
- name: LanguageDetectionModel
autoscaling_config:
min_replicas: 1
initial_replicas: null
max_replicas: 4
target_num_ongoing_requests_per_replica: 1.0
metrics_interval_s: 10.0
look_back_period_s: 30.0
smoothing_factor: 1.0
downscale_delay_s: 900.0
upscale_delay_s: 10.0
health_check_period_s: 10.0
health_check_timeout_s: 900.0
ray_actor_options:
num_cpus: 4.0
num_gpus: 1
accelerator_type: T4
- name: APIIngress
num_replicas: 1
rayClusterConfig:
rayVersion: "2.6.1"
enableInTreeAutoscaling: true
autoscalerOptions:
upscalingMode: Default
idleTimeoutSeconds: 900
resources:
limits:
cpu: 1
memory: 1G
requests:
cpu: 1
memory: 1G
headGroupSpec:
rayStartParams:
dashboard-host: "0.0.0.0"
template:
spec:
containers:
- name: ray-head
image: #extended rayproject/ray:2.6.1-gpu image with custom code added
resources:
limits:
cpu: 4
memory: 12G
requests:
cpu: 4
memory: 12G
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: serve
workerGroupSpecs:
- replicas: 1
minReplicas: 1
maxReplicas: 5
groupName: gpu-group
rayStartParams: {}
template:
spec:
nodeSelector:
cloud.google.com/gke-accelerator: "nvidia-tesla-t4"
containers:
- name: language-detection-worker
image: #extended rayproject/ray:2.6.1-gpu image with custom code added
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "ray stop"]
resources:
limits:
cpu: 4
memory: 8G
nvidia.com/gpu: 1
requests:
cpu: 4
memory: 8G
nvidia.com/gpu: 1
Can anyone maybe explain what I’m currently doing wrong and why the Ray autoscaler is not working as intended?