Autoscaling RayServe Pods in k8s keeps terminating and restarting pods

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I want to run a relatively small transformers model on GPUs in GCP on a k8s cluster. For this, I have written a RayServe service that works locally as well as in the k8s cluster. The GCP k8s cluster has up to 12 GPU nodes available with 1 T4 GPU and 4 CPUs per node, Autoscaling is enabled as well. Creating a GPU based worker pod leads the GCP autoscaler to scale the GPU nodes up from 0, this upscaling takes a few minutes. After uploading my k8s YAML manifest, the service starts and is working as expected.

But as soon as I load test I get problems. The ray cluster tries to autoscale, but keeps sending this log message:

INFO 2023-08-29 05:00:44,015 controller 257 deployment_state.py:1725 - Replica app1_APIIngress#AXsiNL started successfully on node 2e9f3ae0d98736ca8ec38c1c3924708a6098367fe42d434e8471c3fe.
INFO 2023-08-29 05:01:03,930 controller 257 http_state.py:436 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-a24da4fba7b62a2283cdee682caf117149571f57f31bd6256b816b8e' on node 'a24da4fba7b62a2283cdee682caf117149571f57f31bd6256b816b8e' listening on '0.0.0.0:8000'
WARNING 2023-08-29 05:01:10,889 controller 257 deployment_state.py:1902 - Deployment app1_LanguageDetectionModel has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method.
INFO 2023-08-29 05:01:18,785 controller 257 deployment_state.py:1725 - Replica app1_LanguageDetectionModel#WvLDiy started successfully on node a24da4fba7b62a2283cdee682caf117149571f57f31bd6256b816b8e.
INFO 2023-08-29 05:02:29,342 controller 257 deployment_state.py:1396 - Autoscaling deployment app1_LanguageDetectionModel replicas from 1 to 4. Current ongoing requests: [5.333333333333333], current handle queued queries: 0.
INFO 2023-08-29 05:02:29,344 controller 257 deployment_state.py:1571 - Adding 3 replicas to deployment app1_LanguageDetectionModel.
INFO 2023-08-29 05:02:29,344 controller 257 deployment_state.py:353 - Starting replica app1_LanguageDetectionModel#WVBenk for deployment app1_LanguageDetectionModel.
INFO 2023-08-29 05:02:29,366 controller 257 deployment_state.py:353 - Starting replica app1_LanguageDetectionModel#bsytyf for deployment app1_LanguageDetectionModel.
INFO 2023-08-29 05:02:29,371 controller 257 deployment_state.py:353 - Starting replica app1_LanguageDetectionModel#gKCiKi for deployment app1_LanguageDetectionModel.
WARNING 2023-08-29 05:02:59,385 controller 257 deployment_state.py:1882 - Deployment "app1_LanguageDetectionModel" has 3 replicas that have taken more than 30s to be scheduled. This may be caused by waiting for the cluster to auto-scale, or waiting for a runtime environment to install. Resources required for each replica: {"CPU": 4.0, "GPU": 1.0, "accelerator_type:T4": 0.001}, resources available: {"accelerator_type:T4": 0.999, "CPU": 3.0}.
WARNING 2023-08-29 05:03:29,409 controller 257 deployment_state.py:1882 - Deployment "app1_LanguageDetectionModel" has 3 replicas that have taken more than 30s to be scheduled. This may be caused by waiting for the cluster to auto-scale, or waiting for a runtime environment to install. Resources required for each replica: {"CPU": 4.0, "GPU": 1.0, "accelerator_type:T4": 0.001}, resources available: {"accelerator_type:T4": 0.999, "CPU": 3.0}.
WARNING 2023-08-29 05:03:59,478 controller 257 deployment_state.py:1882 - Deployment "app1_LanguageDetectionModel" has 3 replicas that have taken more than 30s to be scheduled. This may be caused by waiting for the cluster to auto-scale, or waiting for a runtime environment to install. Resources required for each replica: {"CPU": 4.0, "GPU": 1.0, "accelerator_type:T4": 0.001}, resources available: {"accelerator_type:T4": 0.999, "CPU": 3.0}.

k8s tries to start a second Ray Worker and Head. The already working Ray cluster head is then terminated in k8s and the new one is available. But I have not yet gotten it to actually run more than 1 instance of the workers at the same time to distribute the test request to.

My k8s YAML manifest currently looks like this:

apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
  name: language-detection
spec:
  serviceUnhealthySecondThreshold: 900 
  deploymentUnhealthySecondThreshold: 300 
  serveConfigV2: |
    applications:

    - name: app1
      route_prefix: /
      import_path: service:entrypoint
      deployments:
      - name: LanguageDetectionModel
        autoscaling_config:
          min_replicas: 1
          initial_replicas: null
          max_replicas: 4
          target_num_ongoing_requests_per_replica: 1.0
          metrics_interval_s: 10.0
          look_back_period_s: 30.0
          smoothing_factor: 1.0
          downscale_delay_s: 900.0
          upscale_delay_s: 10.0
        health_check_period_s: 10.0
        health_check_timeout_s: 900.0
        ray_actor_options:
          num_cpus: 4.0
          num_gpus: 1
          accelerator_type: T4

      - name: APIIngress
        num_replicas: 1

  rayClusterConfig:
    rayVersion: "2.6.1" 
    enableInTreeAutoscaling: true
    autoscalerOptions:
      upscalingMode: Default
      idleTimeoutSeconds: 900
      resources:
        limits:
          cpu: 1
          memory: 1G
        requests:
          cpu: 1
          memory: 1G
    headGroupSpec:
      rayStartParams:
        dashboard-host: "0.0.0.0"
      template:
        spec:
          containers:
            - name: ray-head
              image:  #extended rayproject/ray:2.6.1-gpu image with custom code added
              resources:
                limits:
                  cpu: 4
                  memory: 12G
                requests:
                  cpu: 4
                  memory: 12G
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      - replicas: 1
        minReplicas: 1
        maxReplicas: 5
        groupName: gpu-group
        rayStartParams: {}
        template:
          spec:
            nodeSelector:
              cloud.google.com/gke-accelerator: "nvidia-tesla-t4"
            containers:
              - name: language-detection-worker 
                image: #extended rayproject/ray:2.6.1-gpu image with custom code added
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh", "-c", "ray stop"]
                resources:
                  limits:
                    cpu: 4
                    memory: 8G
                    nvidia.com/gpu: 1
                  requests:
                    cpu: 4
                    memory: 8G
                    nvidia.com/gpu: 1

Can anyone maybe explain what I’m currently doing wrong and why the Ray autoscaler is not working as intended?

The ray autoscaler can now scale up to two workers successfully in k8s when I change the requirements from 4 CPUs to 1, as in this YAML

apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
  name: language-detection
spec:
  serviceUnhealthySecondThreshold: 900 
  deploymentUnhealthySecondThreshold: 300 
  serveConfigV2: |
    applications:

    - name: app1

      route_prefix: /

      import_path: service:entrypoint

      deployments:

      - name: LanguageDetectionModel
        ray_actor_options:
          num_cpus: 1.0
          num_gpus: 1.0

      - name: APIIngress
        num_replicas: 1

  rayClusterConfig:
    rayVersion: "2.6.1" 
    enableInTreeAutoscaling: true
    autoscalerOptions:
      upscalingMode: Default
      idleTimeoutSeconds: 900
      resources:
        limits:
          cpu: 1
          memory: 1G
        requests:
          cpu: 1
          memory: 1G
    headGroupSpec:
      rayStartParams:
        dashboard-host: "0.0.0.0"
      template:
        spec:
          containers:
            - name: ray-head
              image: #extended rayproject/ray:2.6.1-gpu image with custom code added
              resources:
                limits:
                  cpu: 4
                  memory: 12G
                requests:
                  cpu: 4
                  memory: 12G
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      - replicas: 1
        minReplicas: 1
        maxReplicas: 5
        groupName: gpu-group
        rayStartParams: {}
        template:
          spec:
            nodeSelector:
              cloud.google.com/gke-accelerator: "nvidia-tesla-t4"
            containers:
              - name: language-detection-worker 
                image:  #extended rayproject/ray:2.6.1-gpu image with custom code added
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh", "-c", "ray stop"]
                resources:
                  limits:
                    cpu: 4
                    memory: 8G
                    nvidia.com/gpu: 1

                  requests:
                    cpu: 4
                    memory: 8G
                    nvidia.com/gpu: 1

But it still fails when trying to start the third replica. I was able to capture the log just before the Ray Head was terminated:

WARNING 2023-08-29 10:11:54,875 controller 268 deployment_state.py:1882 - Deployment "app1_LanguageDetectionModel" has 1 replicas that have taken more than 30s to be scheduled. This may be caused by waiting for the cluster to auto-scale, or waiting for a runtime environment to install. Resources required for each replica: {"CPU": 1.0, "GPU": 1.0}, resources available: {"CPU": 6.0}.
INFO 2023-08-29 10:12:07,073 controller 268 http_state.py:436 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-98ab7472fb6b3619acf72050238d22e30039cfed4a4dfa9ed1105eb3' on node '98ab7472fb6b3619acf72050238d22e30039cfed4a4dfa9ed1105eb3' listening on '0.0.0.0:8000'
INFO 2023-08-29 10:12:21,796 controller 268 deployment_state.py:1725 - Replica app1_LanguageDetectionModel#zpkwZj started successfully on node 98ab7472fb6b3619acf72050238d22e30039cfed4a4dfa9ed1105eb3.
WARNING 2023-08-29 10:19:10,379 controller 268 http_state.py:197 - Health check for HTTP proxy SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-e23c3393482dc53b820972145d21bd236e86f4105205d1bb60f0a26a failed: The actor died unexpectedly before finishing this task.
WARNING 2023-08-29 10:19:11,522 controller 268 deployment_state.py:735 - Actor for replica app1_LanguageDetectionModel#zpkwZj crashed, marking it unhealthy immediately.
WARNING 2023-08-29 10:19:11,522 controller 268 deployment_state.py:1812 - Replica app1_LanguageDetectionModel#zpkwZj of deployment app1_LanguageDetectionModel failed health check, stopping it.
INFO 2023-08-29 10:19:11,523 controller 268 deployment_state.py:892 - Stopping replica app1_LanguageDetectionModel#zpkwZj for deployment app1_LanguageDetectionModel.
ERROR 2023-08-29 10:19:11,535 controller 268 deployment_state.py:617 - Exception when trying to gracefully shutdown replica:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/deployment_state.py", line 615, in check_stopped
    ray.get(self._graceful_shutdown_ref)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 2495, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/deployment_state.py", line 615, in check_stopped
    ray.get(self._graceful_shutdown_ref)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 2495, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

I’m always sending the same test load data, so I don’t understand why the Ray actor just dies without explanation.

Anyone in serve team: @shrekris @Kai-Hsun_Chen

I solved it myself. There were two errors from my side:

  • I forgot to build and push a new version of my custom docker container after making changes to the autoscaler values in the Ray Serve Python code. This resulted in a mismatch between the values in the Python script, the values in the k8s deployment file and my expected values for ressources.
  • The k8s nodepools were still too small, I needed to add a beefier nodepool with more CPUs and memory to satisfy all requests from the Ray workers.

Can you please share your python file?