Hi!
We have a ray 1.10.0 in a cluster k8s and it works as expected. But with our current PROD configuration, when head pod is “terminated” by k8s the system is not able to recover from the failure. We made a few changes in our DEV environment:
- Added an external redis.
- Set
restartPolicy: Always
for head pod. - Modify
head_start_ray_commands
(bc is only invoked byray start
) to- Copy
ray_bootstrap_config.yaml
to a volatil volume (otherwise this file is lost in the pod restart). - Clear redis content.
- Start ray head
- Launch our basic deployments
- Copy
- Added a
lifecycle -> postStart -> exec -> command
script that identifies when the pod is restarted, and if so, relaunch ray as head withray start --head
In our test we launch our full platform, has several deployments with one pod for gpu and 2 cpu worker nodes. When we force the pod to restart (by killing process sleep infinity
) we have tried different approachs but we cannot achieve to make ray recover properly.
In the restart we tried:
ray start --head --autoscaling-config=/home/ray/temporal/ray_bootstrap_config.yaml --dashboard-host 0.0.0.0 --address=redis-service.ray.svc.cluster.local:6379 --redis-password=foobared
ray start --head --dashboard-host 0.0.0.0 --address=redis-service.ray.svc.cluster.local:6379 --redis-password=foobared
We saw:
- The web dashboard does not work:
react-dom.production.min.js:209 TypeError: Cannot read properties of undefined (reading 'length')
- Several error/warnings logs can be found in logs:
(scheduler +14s) Warning: The following resource request cannot be scheduled right now: {'GPU': 0.2, 'CPU': 0.1, 'memory': 4194304000.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
core_worker.h:964: Mismatched WorkerID: ignoring RPC for previous worker 3e6943d954d805c5086d186d55cace882585db3522d719b492af2a8b, current worker ID: cff9688adf4d92c38cdf49a6a6ef8bab1ef54f80971beaf022eedd3a
-
Deployment 'search-resource' has 1 replicas that have taken more than 30s to be scheduled. This may be caused by waiting for the cluster to auto-scale, or waiting for a runtime environment to install. Resources required for each replica: {'CPU': 0.2}, resources available: {'CPU': 6.7}. component=serve deployment=search-resource
- cannot allocate deployments. - Tries to deploy all previous deployments, but node workers weren’t restarted.
- If we ask ray to execute something (i.e.
ray.init()
) the process never ends, only logs like below are displayed.
And particular in without autoscaling-config
we see:
(ingestion-helper pid=191, ip=10.92.6.3) 2022-03-03 07:09:06,381 ERROR gcs_utils.py:142 -- Failed to send request to gcs, reconnecting. Error <_InactiveRpcError of RPC that terminated with:
(ingestion-helper pid=191, ip=10.92.6.3) status = StatusCode.UNAVAILABLE
(ingestion-helper pid=191, ip=10.92.6.3) details = "failed to connect to all addresses"
(ingestion-helper pid=191, ip=10.92.6.3) debug_error_string = "{"created":"@1646320146.381345389","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3134,"referenced_errors":[{"created":"@1646320146.381343968","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":163,"grpc_status":14}]}"
(ingestion-helper pid=191, ip=10.92.6.3) > component=serve deployment=ingestion-helper replica=ingestion-helper#jawqlL
Are we doing something wrong? How we must configure/start ray to recover from a head failure?
Thanks in advance