Handle "Cuda out of memory" exception on ray serve replica

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I’m serving an AI model on a small ray cluster which receives images by arbitrary sizes. Because of speed, it should try to the inference on the GPU but if the image is too large it has to use the cpu. Usually I catch a “cuda out of memory” exception with a simple try/except in python but that doesn’t seem to work with a ray serve replica serving via http request. Once the Cuda OOM exception is thrown it doesn’t continue but stops the task with the following output:

replica.py:510 - HANDLE __call__ ERROR 2384.0ms

future: <Task finished coro=<_wrap_awaitable() done, defined at C:\Python37\lib\asyncio\tasks.py:623> exception=RayTaskError(RuntimeError)(RuntimeError...
RuntimeError: CUDA out of memory. Tried to allocate 116.00 MiB (GPU