Ray cluster shutting down while fastapi app stays alive

I’ve got a service running with FastAPI and connecting to a ray cluster. My issue is that when the ray cluster crashes, for some reason, the fastAPI application stays alive, making the whole service unresponsive. I’d need a way to stop the whole application, letting crash the pod. AT the moment I cannot even get notified of the situation if I do not look at the logs

SOme mor edetails:

I have one pod starting a ray cluster with
ray tart --head
And in the same pod I then star a fastapi app which I serve with gunicorn and uvicorn workers.
The app creates one ray actor which is essentially a pika client which consumes from a specific queue and does heavy calculation (it is a preprocessing/training ML application). WHht at some point happens is that at some point the ray cluster stops, not sure why, trying to find out in the logs but no luck yet. Now my big issue is not that the ray cluster stops, it might just go out of memory and it is fine, but the problem is that when the ray cluster stops the gunicorn served application does not stop, it keeps restarting the workers, which will fail forever as the ray cluster instance is off. Now I’d need to have a way to stop the gunicorn app if the ray client gets disconnected but I cannot find it.

import ray
from fastapi import FastAPI
from domain.dummy_worker import DummyWorker

app = FastAPI()

@app.get("/")
def read_root():
return {“Hello”: f"World {ray.get(dummy2.gimme_a_number.remote())}"}

@app.on_event(“startup”)
def main():
ray.init(address=“auto”, dashboard_host=“127.0.0.1”,
dashboard_port=8260,
_memory=78643200)

Dummy = ray.remote(DummyWorker)

dummy = Dummy.remote()

dummy.run.remote()

global dummy2
dummy2 = Dummy.remote()

if name == “main”:
main()

in order to run it I have an sh script which I run in my docker container as CMD entry point,
the sh script essentually does:

ray start --head

sh start.sh

and start.sh is essentially the one available in tinagolo’s docker image, which configures gunicorn and at the end just does:

gunicorn -k blabla.UvicornWOrker $"APPLICATION_NAME"

Now when ray crashes for some reason, the application becomes unresponsive

So, if there’s a way to detect Ray failure, that will be sufficient to solve your problem?

e.g., you can have a background task that checks the health of ray, and when it detects Ray failures, it can kill the fast API process.