I am running ray serve on an auto-scaling ray cluster deployed on kubernetes. I am trying to debug an issue with the deployment suddenly getting blocked after successfully receiving requests for some time.
Here is a simplified snippet of my setup -
@serve.deployment(
max_concurrent_queries=2,
ray_actor_options={
"max_concurrency": 1,
"memory": 30 * 1024 * 1024 * 1024
},
_autoscaling_config={
"target_num_ongoing_requests_per_replica": 1,
"max_replicas": 10,
"min_replicas": 3
},
version="0.1"
)
class Model:
def predict(self):
# preprocessing, inference and post-processing
return "y"
app = FastAPI()
results = {}
@serve.deployment(route_prefix="/")
@serve.ingress(app)
class API:
def __init__(self):
self._debug = True
@app.post("/")
async def predict(self):
logger.info("request start")
task_id = str(uuid.uuid4())
handle = Model.get_handle(sync=False)
coro = handle.predict.remote()
async def wait(coro, task_id):
ref = await coro
pred = await ref
results[task_id] = pred
async def debug():
while True:
logger.info("running")
await asyncio.sleep(10)
loop = asyncio.get_running_loop()
loop.set_debug(True)
if self._debug:
loop.create_task(debug())
self._debug = False
loop.create_task(wait(coro, task_id))
return task_id
@app.get("/{task_id}")
async def get(self, task_id: str):
return results[task_id]
The Model deployment computation is quite expensive, it takes about 1 minute. So the predict method in the API returns an id to the caller immediately and waits for the result from the model in a background task.
This works well initially. But when subjecting the API to a sustained load of about 3 predict requests/minute, after about 30-60 minutes it stops receiving any request. Further requests to both the predict and get endpoint of API hangs.
I have verified that the event loop in the API actor is not blocked when this happens - the debug
coroutine keeps running. If I redeploy the API deployment at this point, it starts receiving the requests again, so the problem is definitely somewhere within ray. I have taken a look the logs of HttpProxyActor and other ray core worker logs, but did not find anything relevant. How can I troubleshoot this further?
I have further observed in my trials that if the autoscaling of the Model deployment is turned off, or if I set the max_replicas much lower to 4 to limit the autoscaling, then this problem did not occur.