Replicas being force-killed and eventually stopped blocking production deployment

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I am trying to build my app using command : “serve build main:query_events -o serve_config.yaml”

But in the logs I see that the replicas are being force killed and not getting started , here are the logs

(ServeController pid=26592) INFO 2024-08-06 10:27:33,418 controller 26592 deployment_state.py:2182 - Replica(id=‘0i1mpnrx’, deployment=‘query-events-deployment’, app=‘default’) is stopped.
(ServeReplica:default:query-events-deployment pid=820) INFO:botocore.credentials:Found credentials in environment variables.
(ServeReplica:default:query-events-deployment pid=820) INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings “HTTP/1.1 200 OK”
(ServeReplica:default:query-events-deployment pid=820) INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings “HTTP/1.1 200 OK”
(ServeReplica:default:query-events-deployment pid=820) WARNING 2024-08-06 10:27:46,279 default_query-events-deployment 2b1epayu api.py:419 - DeprecationWarning: route_prefix in @serve.deployment has been deprecated. To specify a route prefix for an application, pass it into serve.run instead.
(ServeReplica:default:query-events-deployment pid=820) WARNING 2024-08-06 10:27:46,289 default_query-events-deployment 2b1epayu api.py:432 - The default value for max_ongoing_requests has changed from 100 to 5 in Ray 2.32.0.
(ServeReplica:default:query-events-deployment pid=820) INFO:C:\Users\gupta shitul\AppData\Local\Programs\Python\Python310\lib\site-packages\ray\serve_private\api.py:Connecting to existing Serve app in namespace “serve”. New http options will not be applied.
(ServeController pid=26592) INFO 2024-08-06 10:27:46,418 controller 26592 deployment_state.py:1598 - Deploying new version of Deployment(name=‘query-events-deployment’, app=‘default’) (initial target replicas: 1).
(ServeController pid=26592) INFO 2024-08-06 10:27:46,528 controller 26592 deployment_state.py:1721 - Stopping 1 replicas of Deployment(name=‘query-events-deployment’, app=‘default’) with outdated versions.
(ServeController pid=26592) INFO 2024-08-06 10:27:46,528 controller 26592 deployment_state.py:1844 - Adding 1 replica to Deployment(name=‘query-events-deployment’, app=‘default’).
(ServeController pid=26592) INFO 2024-08-06 10:27:46,693 controller 26592 deployment_state.py:1042 - Replica(id=‘2b1epayu’, deployment=‘query-events-deployment’, app=‘default’) did not shut down after grace period, force-killing it.
(ServeController pid=26592) INFO 2024-08-06 10:27:46,802 controller 26592 deployment_state.py:2182 - Replica(id=‘2b1epayu’, deployment=‘query-events-deployment’, app=‘default’) is stopped.
(ServeReplica:default:query-events-deployment pid=14304) INFO:botocore.credentials:Found credentials in environment variables.

When I locally run it with python command like python main.py it runs fine. Can you please help here ASAP as its blocking my production deployment.

It would be great if I can get some time over a call as I am trying to set this up over EKS but stuck with similar issues. there are not enough logs to help me here.

Hi @GuptaShitul this is not really enough information to help debug. Could you provide the full controller and replica logs?

Would also suggest looking for help in the Ray slack #serve channel.