Ray cluster shutting down while fastapi app stays alive

delioda79 · March 19, 2021, 10:16pm

I’ve got a service running with FastAPI and connecting to a ray cluster. My issue is that when the ray cluster crashes, for some reason, the fastAPI application stays alive, making the whole service unresponsive. I’d need a way to stop the whole application, letting crash the pod. AT the moment I cannot even get notified of the situation if I do not look at the logs

delioda79 · March 20, 2021, 8:00am

SOme mor edetails:

I have one pod starting a ray cluster with
ray tart --head
And in the same pod I then star a fastapi app which I serve with gunicorn and uvicorn workers.
The app creates one ray actor which is essentially a pika client which consumes from a specific queue and does heavy calculation (it is a preprocessing/training ML application). WHht at some point happens is that at some point the ray cluster stops, not sure why, trying to find out in the logs but no luck yet. Now my big issue is not that the ray cluster stops, it might just go out of memory and it is fine, but the problem is that when the ray cluster stops the gunicorn served application does not stop, it keeps restarting the workers, which will fail forever as the ray cluster instance is off. Now I’d need to have a way to stop the gunicorn app if the ray client gets disconnected but I cannot find it.

delioda79 · March 20, 2021, 8:02am

import ray
from fastapi import FastAPI
from domain.dummy_worker import DummyWorker

app = FastAPI()

@app.get("/")
def read_root():
return {“Hello”: f"World {ray.get(dummy2.gimme_a_number.remote())}"}

@app.on_event(“startup”)
def main():
ray.init(address=“auto”, dashboard_host=“127.0.0.1”,
dashboard_port=8260,
_memory=78643200)

Dummy = ray.remote(DummyWorker)

dummy = Dummy.remote()

dummy.run.remote()

global dummy2
dummy2 = Dummy.remote()

if name == “main”:
main()

delioda79 · March 20, 2021, 8:06am

in order to run it I have an sh script which I run in my docker container as CMD entry point,
the sh script essentually does:

ray start --head

sh start.sh

and start.sh is essentially the one available in tinagolo’s docker image, which configures gunicorn and at the end just does:

gunicorn -k blabla.UvicornWOrker $"APPLICATION_NAME"

Now when ray crashes for some reason, the application becomes unresponsive

sangcho · March 25, 2021, 8:51am

So, if there’s a way to detect Ray failure, that will be sufficient to solve your problem?

sangcho · March 25, 2021, 8:54am

e.g., you can have a background task that checks the health of ray, and when it detects Ray failures, it can kill the fast API process.

Topic		Replies	Views
Docker / Apps instantly shutdown after done creating replica Ray Serve	1	319	October 13, 2023
Proper way to shutdown and restart ray serve and deployments Ray Serve	2	2199	October 19, 2021
Unable to run FastAPI Ray Serve Deployment Example Ray Serve	3	1248	August 4, 2022
Cleanup of FastAPI Deployments Ray Serve	1	575	September 3, 2021
Ray task + fastapi suspected memory leak Ray Core	1	753	November 23, 2023

Ray cluster shutting down while fastapi app stays alive

Related topics