Autoscaling RayServe Pods in k8s keeps terminating and restarting pods

Markus_Rosenfelder · August 29, 2023, 12:10pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I want to run a relatively small transformers model on GPUs in GCP on a k8s cluster. For this, I have written a RayServe service that works locally as well as in the k8s cluster. The GCP k8s cluster has up to 12 GPU nodes available with 1 T4 GPU and 4 CPUs per node, Autoscaling is enabled as well. Creating a GPU based worker pod leads the GCP autoscaler to scale the GPU nodes up from 0, this upscaling takes a few minutes. After uploading my k8s YAML manifest, the service starts and is working as expected.

But as soon as I load test I get problems. The ray cluster tries to autoscale, but keeps sending this log message:

INFO 2023-08-29 05:00:44,015 controller 257 deployment_state.py:1725 - Replica app1_APIIngress#AXsiNL started successfully on node 2e9f3ae0d98736ca8ec38c1c3924708a6098367fe42d434e8471c3fe.
INFO 2023-08-29 05:01:03,930 controller 257 http_state.py:436 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-a24da4fba7b62a2283cdee682caf117149571f57f31bd6256b816b8e' on node 'a24da4fba7b62a2283cdee682caf117149571f57f31bd6256b816b8e' listening on '0.0.0.0:8000'
WARNING 2023-08-29 05:01:10,889 controller 257 deployment_state.py:1902 - Deployment app1_LanguageDetectionModel has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method.
INFO 2023-08-29 05:01:18,785 controller 257 deployment_state.py:1725 - Replica app1_LanguageDetectionModel#WvLDiy started successfully on node a24da4fba7b62a2283cdee682caf117149571f57f31bd6256b816b8e.
INFO 2023-08-29 05:02:29,342 controller 257 deployment_state.py:1396 - Autoscaling deployment app1_LanguageDetectionModel replicas from 1 to 4. Current ongoing requests: [5.333333333333333], current handle queued queries: 0.
INFO 2023-08-29 05:02:29,344 controller 257 deployment_state.py:1571 - Adding 3 replicas to deployment app1_LanguageDetectionModel.
INFO 2023-08-29 05:02:29,344 controller 257 deployment_state.py:353 - Starting replica app1_LanguageDetectionModel#WVBenk for deployment app1_LanguageDetectionModel.
INFO 2023-08-29 05:02:29,366 controller 257 deployment_state.py:353 - Starting replica app1_LanguageDetectionModel#bsytyf for deployment app1_LanguageDetectionModel.
INFO 2023-08-29 05:02:29,371 controller 257 deployment_state.py:353 - Starting replica app1_LanguageDetectionModel#gKCiKi for deployment app1_LanguageDetectionModel.
WARNING 2023-08-29 05:02:59,385 controller 257 deployment_state.py:1882 - Deployment "app1_LanguageDetectionModel" has 3 replicas that have taken more than 30s to be scheduled. This may be caused by waiting for the cluster to auto-scale, or waiting for a runtime environment to install. Resources required for each replica: {"CPU": 4.0, "GPU": 1.0, "accelerator_type:T4": 0.001}, resources available: {"accelerator_type:T4": 0.999, "CPU": 3.0}.
WARNING 2023-08-29 05:03:29,409 controller 257 deployment_state.py:1882 - Deployment "app1_LanguageDetectionModel" has 3 replicas that have taken more than 30s to be scheduled. This may be caused by waiting for the cluster to auto-scale, or waiting for a runtime environment to install. Resources required for each replica: {"CPU": 4.0, "GPU": 1.0, "accelerator_type:T4": 0.001}, resources available: {"accelerator_type:T4": 0.999, "CPU": 3.0}.
WARNING 2023-08-29 05:03:59,478 controller 257 deployment_state.py:1882 - Deployment "app1_LanguageDetectionModel" has 3 replicas that have taken more than 30s to be scheduled. This may be caused by waiting for the cluster to auto-scale, or waiting for a runtime environment to install. Resources required for each replica: {"CPU": 4.0, "GPU": 1.0, "accelerator_type:T4": 0.001}, resources available: {"accelerator_type:T4": 0.999, "CPU": 3.0}.

k8s tries to start a second Ray Worker and Head. The already working Ray cluster head is then terminated in k8s and the new one is available. But I have not yet gotten it to actually run more than 1 instance of the workers at the same time to distribute the test request to.

My k8s YAML manifest currently looks like this:

apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
  name: language-detection
spec:
  serviceUnhealthySecondThreshold: 900 
  deploymentUnhealthySecondThreshold: 300 
  serveConfigV2: |
    applications:

    - name: app1
      route_prefix: /
      import_path: service:entrypoint
      deployments:
      - name: LanguageDetectionModel
        autoscaling_config:
          min_replicas: 1
          initial_replicas: null
          max_replicas: 4
          target_num_ongoing_requests_per_replica: 1.0
          metrics_interval_s: 10.0
          look_back_period_s: 30.0
          smoothing_factor: 1.0
          downscale_delay_s: 900.0
          upscale_delay_s: 10.0
        health_check_period_s: 10.0
        health_check_timeout_s: 900.0
        ray_actor_options:
          num_cpus: 4.0
          num_gpus: 1
          accelerator_type: T4

      - name: APIIngress
        num_replicas: 1

  rayClusterConfig:
    rayVersion: "2.6.1" 
    enableInTreeAutoscaling: true
    autoscalerOptions:
      upscalingMode: Default
      idleTimeoutSeconds: 900
      resources:
        limits:
          cpu: 1
          memory: 1G
        requests:
          cpu: 1
          memory: 1G
    headGroupSpec:
      rayStartParams:
        dashboard-host: "0.0.0.0"
      template:
        spec:
          containers:
            - name: ray-head
              image:  #extended rayproject/ray:2.6.1-gpu image with custom code added
              resources:
                limits:
                  cpu: 4
                  memory: 12G
                requests:
                  cpu: 4
                  memory: 12G
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      - replicas: 1
        minReplicas: 1
        maxReplicas: 5
        groupName: gpu-group
        rayStartParams: {}
        template:
          spec:
            nodeSelector:
              cloud.google.com/gke-accelerator: "nvidia-tesla-t4"
            containers:
              - name: language-detection-worker 
                image: #extended rayproject/ray:2.6.1-gpu image with custom code added
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh", "-c", "ray stop"]
                resources:
                  limits:
                    cpu: 4
                    memory: 8G
                    nvidia.com/gpu: 1
                  requests:
                    cpu: 4
                    memory: 8G
                    nvidia.com/gpu: 1

Can anyone maybe explain what I’m currently doing wrong and why the Ray autoscaler is not working as intended?

Markus_Rosenfelder · August 29, 2023, 5:15pm

The ray autoscaler can now scale up to two workers successfully in k8s when I change the requirements from 4 CPUs to 1, as in this YAML

apiVersion: ray.io/v1alpha1
kind: RayService
metadata:
  name: language-detection
spec:
  serviceUnhealthySecondThreshold: 900 
  deploymentUnhealthySecondThreshold: 300 
  serveConfigV2: |
    applications:

    - name: app1

      route_prefix: /

      import_path: service:entrypoint

      deployments:

      - name: LanguageDetectionModel
        ray_actor_options:
          num_cpus: 1.0
          num_gpus: 1.0

      - name: APIIngress
        num_replicas: 1

  rayClusterConfig:
    rayVersion: "2.6.1" 
    enableInTreeAutoscaling: true
    autoscalerOptions:
      upscalingMode: Default
      idleTimeoutSeconds: 900
      resources:
        limits:
          cpu: 1
          memory: 1G
        requests:
          cpu: 1
          memory: 1G
    headGroupSpec:
      rayStartParams:
        dashboard-host: "0.0.0.0"
      template:
        spec:
          containers:
            - name: ray-head
              image: #extended rayproject/ray:2.6.1-gpu image with custom code added
              resources:
                limits:
                  cpu: 4
                  memory: 12G
                requests:
                  cpu: 4
                  memory: 12G
              ports:
                - containerPort: 6379
                  name: gcs-server
                - containerPort: 8265
                  name: dashboard
                - containerPort: 10001
                  name: client
                - containerPort: 8000
                  name: serve
    workerGroupSpecs:
      - replicas: 1
        minReplicas: 1
        maxReplicas: 5
        groupName: gpu-group
        rayStartParams: {}
        template:
          spec:
            nodeSelector:
              cloud.google.com/gke-accelerator: "nvidia-tesla-t4"
            containers:
              - name: language-detection-worker 
                image:  #extended rayproject/ray:2.6.1-gpu image with custom code added
                lifecycle:
                  preStop:
                    exec:
                      command: ["/bin/sh", "-c", "ray stop"]
                resources:
                  limits:
                    cpu: 4
                    memory: 8G
                    nvidia.com/gpu: 1

                  requests:
                    cpu: 4
                    memory: 8G
                    nvidia.com/gpu: 1

But it still fails when trying to start the third replica. I was able to capture the log just before the Ray Head was terminated:

WARNING 2023-08-29 10:11:54,875 controller 268 deployment_state.py:1882 - Deployment "app1_LanguageDetectionModel" has 1 replicas that have taken more than 30s to be scheduled. This may be caused by waiting for the cluster to auto-scale, or waiting for a runtime environment to install. Resources required for each replica: {"CPU": 1.0, "GPU": 1.0}, resources available: {"CPU": 6.0}.
INFO 2023-08-29 10:12:07,073 controller 268 http_state.py:436 - Starting HTTP proxy with name 'SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-98ab7472fb6b3619acf72050238d22e30039cfed4a4dfa9ed1105eb3' on node '98ab7472fb6b3619acf72050238d22e30039cfed4a4dfa9ed1105eb3' listening on '0.0.0.0:8000'
INFO 2023-08-29 10:12:21,796 controller 268 deployment_state.py:1725 - Replica app1_LanguageDetectionModel#zpkwZj started successfully on node 98ab7472fb6b3619acf72050238d22e30039cfed4a4dfa9ed1105eb3.
WARNING 2023-08-29 10:19:10,379 controller 268 http_state.py:197 - Health check for HTTP proxy SERVE_CONTROLLER_ACTOR:SERVE_PROXY_ACTOR-e23c3393482dc53b820972145d21bd236e86f4105205d1bb60f0a26a failed: The actor died unexpectedly before finishing this task.
WARNING 2023-08-29 10:19:11,522 controller 268 deployment_state.py:735 - Actor for replica app1_LanguageDetectionModel#zpkwZj crashed, marking it unhealthy immediately.
WARNING 2023-08-29 10:19:11,522 controller 268 deployment_state.py:1812 - Replica app1_LanguageDetectionModel#zpkwZj of deployment app1_LanguageDetectionModel failed health check, stopping it.
INFO 2023-08-29 10:19:11,523 controller 268 deployment_state.py:892 - Stopping replica app1_LanguageDetectionModel#zpkwZj for deployment app1_LanguageDetectionModel.
ERROR 2023-08-29 10:19:11,535 controller 268 deployment_state.py:617 - Exception when trying to gracefully shutdown replica:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/deployment_state.py", line 615, in check_stopped
    ray.get(self._graceful_shutdown_ref)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 2495, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/serve/_private/deployment_state.py", line 615, in check_stopped
    ray.get(self._graceful_shutdown_ref)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 2495, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

I’m always sending the same test load data, so I don’t understand why the Ray actor just dies without explanation.

Jules_Damji · August 29, 2023, 8:18pm

Anyone in serve team: @shrekris @Kai-Hsun_Chen

Markus_Rosenfelder · August 30, 2023, 4:12pm

I solved it myself. There were two errors from my side:

I forgot to build and push a new version of my custom docker container after making changes to the autoscaler values in the Ray Serve Python code. This resulted in a mismatch between the values in the Python script, the values in the k8s deployment file and my expected values for ressources.
The k8s nodepools were still too small, I needed to add a beefier nodepool with more CPUs and memory to satisfy all requests from the Ray workers.

Subodh_Nigam · November 20, 2023, 11:08am

Can you please share your python file?

Topic		Replies	Views
Serve autoscaling in EKS Ray Serve	7	803	June 3, 2024
Autoscaler issues with the K8 Operator Kubernetes	8	667	March 2, 2021
Testing autoscaler Kubernetes	15	1547	March 16, 2021
Autoscaler launches extra nodes Ray Clusters	0	379	June 14, 2023
RayServe Autoscaling not creating Ray Pods Ray Serve	3	288	March 29, 2024

Autoscaling RayServe Pods in k8s keeps terminating and restarting pods

Related topics