Unable to recover from head-pod failure in k8s

Hi!

We have a ray 1.10.0 in a cluster k8s and it works as expected. But with our current PROD configuration, when head pod is “terminated” by k8s the system is not able to recover from the failure. We made a few changes in our DEV environment:

  • Added an external redis.
  • Set restartPolicy: Always for head pod.
  • Modify head_start_ray_commands (bc is only invoked by ray start) to
    • Copy ray_bootstrap_config.yaml to a volatil volume (otherwise this file is lost in the pod restart).
    • Clear redis content.
    • Start ray head
    • Launch our basic deployments
  • Added a lifecycle -> postStart -> exec -> command script that identifies when the pod is restarted, and if so, relaunch ray as head with ray start --head

In our test we launch our full platform, has several deployments with one pod for gpu and 2 cpu worker nodes. When we force the pod to restart (by killing process sleep infinity) we have tried different approachs but we cannot achieve to make ray recover properly.

In the restart we tried:

  • ray start --head --autoscaling-config=/home/ray/temporal/ray_bootstrap_config.yaml --dashboard-host 0.0.0.0 --address=redis-service.ray.svc.cluster.local:6379 --redis-password=foobared
  • ray start --head --dashboard-host 0.0.0.0 --address=redis-service.ray.svc.cluster.local:6379 --redis-password=foobared

We saw:

  • The web dashboard does not work: react-dom.production.min.js:209 TypeError: Cannot read properties of undefined (reading 'length')
  • Several error/warnings logs can be found in logs:
  • (scheduler +14s) Warning: The following resource request cannot be scheduled right now: {'GPU': 0.2, 'CPU': 0.1, 'memory': 4194304000.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
  • core_worker.h:964: Mismatched WorkerID: ignoring RPC for previous worker 3e6943d954d805c5086d186d55cace882585db3522d719b492af2a8b, current worker ID: cff9688adf4d92c38cdf49a6a6ef8bab1ef54f80971beaf022eedd3a
  • Deployment 'search-resource' has 1 replicas that have taken more than 30s to be scheduled. This may be caused by waiting for the cluster to auto-scale, or waiting for a runtime environment to install. Resources required for each replica: {'CPU': 0.2}, resources available: {'CPU': 6.7}. component=serve deployment=search-resource - cannot allocate deployments.
  • Tries to deploy all previous deployments, but node workers weren’t restarted.
  • If we ask ray to execute something (i.e. ray.init()) the process never ends, only logs like below are displayed.

And particular in without autoscaling-config we see:

(ingestion-helper pid=191, ip=10.92.6.3) 2022-03-03 07:09:06,381        ERROR gcs_utils.py:142 -- Failed to send request to gcs, reconnecting. Error <_InactiveRpcError of RPC that terminated with:
(ingestion-helper pid=191, ip=10.92.6.3)        status = StatusCode.UNAVAILABLE
(ingestion-helper pid=191, ip=10.92.6.3)        details = "failed to connect to all addresses"
(ingestion-helper pid=191, ip=10.92.6.3)        debug_error_string = "{"created":"@1646320146.381345389","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1646320146.381343968","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
(ingestion-helper pid=191, ip=10.92.6.3) > component=serve deployment=ingestion-helper replica=ingestion-helper#jawqlL

Are we doing something wrong? How we must configure/start ray to recover from a head failure?

Thanks in advance

  1. Failure to restart the head is a bug. To figure out the issue, it would be good to see operator logs from around the time of head failure (for a cluster running without the dev changes described above)
  2. You may want to look into the KubeRay operator GitHub - ray-project/kuberay: A toolkit to run Ray applications on Kubernetes for a more stable and scalable deployment solution. We’re currently working on integrating the full Ray feature set (most notably, autoscaling) with KubeRay.

Today we started to test kuberay. Our plan is to make a production deployment soon, how stable is the current kuberay project for production environments?

Another question, we are working with Ray 1.10.0, but in the examples of kuberay is specified 1.8.0. Is the current version of kuberay valid for 1.10?

Yep, KubeRay is compatible with Ray 1.10.0

We are currently moving from a yaml similar to ray/example-full.yaml at master · ray-project/ray · GitHub with heterogenous workers to kuberay.

In our configuration we have a set of custom resources to prevent some kind of deployments/actors to be executed in some workers.

Configuration excerpt:

available_node_types:
  worker_node:
    # Minimum number of Ray workers of this Pod type.
    min_workers: 2
    # Maximum number of Ray workers of this Pod type. Takes precedence over min_workers.
    max_workers: 4
    # User-specified custom resources for use by Ray. Object with string keys and integer values.
    # (Ray detects CPU and GPU from pod spec resource requests and limits, so no need to fill those here.)
    # resources: {"example-resource-a": 1, "example-resource-b": 2}
    resources: { "only_cpu": 9001 }

We tried a differents approaches but the pods never start properly. We also found and commented on an issue in github: [Feature] Users should be able to define custom resources in worker groups · Issue #167 · ray-project/kuberay · GitHub

In the commit there is the following comment:

      # Use `resources` to optionally specify custom resource annotations for the Ray node.
      # The value of `resources` is a string-integer mapping.
      # Currently, `resources` must be provided in the unfortunate format demonstrated below.

In the example I didn’t found the demostration. Can you provide us an example of how to configure this?

We also tried different configurations but none of them have worked.

I think this is a great question for the #KubeRay channel in the Ray Slack or for an issue in the KubeRay GitHub repo.

Eugene helps us to move on this.

The solution was:

resources: "'{\"customRes\": 1, \"anotherOne\": 2}'"

For this scenario, maybe resources attribute specificación should admit a map, something like:

rayStartParams:
  # ...
  resources: 
    customRes: 1
    anotherOne: 2

Yes, fixing that particular detail is definitely on our to-do.

A few days ago, we updated to 1.11.0 and used the namespaced operator (i.e. Ray Operator Advanced Configuration — Ray 1.11.0).

If this can help anybody, here is quick summary.

We also created a bash script and added it to our custom Docker image in order to launch our infrastructure on every head re/start. In the configuration of the Ray Cluster we have:

  headStartRayCommands:
    - ray stop
    - ulimit -n 65536; ray start --head --port=6379 --no-monitor --dashboard-host 0.0.0.0 &> /tmp/raylogs
    - /home/ray/start-head-lite.sh

Where start-head-lite.sh has:

#!/bin/bash

log_path="/home/ray/"
log_file="${log_path}$(date +"%Y_%m_%d_%I_%M_%S")_relaunch.log"

if [[ ! -f start-head-lite.pid ]]; then
  echo $$ > start-head-lite.pid
  nohup python /home/ray/utils/head-utils/startup-process.py >> $log_file 2>&1 &
else
  echo 'Process already launched' >> $log_file
fi

We noticed that when the head pod is restarted (e.g. removing the head-pod), the headStartRayCommands configuration seems to be executed twice in the new head-pod. That is the reason we added the start-head-lite.pid.