How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hi,
I’m serving an AI model on a small ray cluster which receives images by arbitrary sizes. Because of speed, it should try to the inference on the GPU but if the image is too large it has to use the cpu. Usually I catch a “cuda out of memory” exception with a simple try/except in python but that doesn’t seem to work with a ray serve replica serving via http request. Once the Cuda OOM exception is thrown it doesn’t continue but stops the task with the following output:
replica.py:510 - HANDLE __call__ ERROR 2384.0ms
future: <Task finished coro=<_wrap_awaitable() done, defined at C:\Python37\lib\asyncio\tasks.py:623> exception=RayTaskError(RuntimeError)(RuntimeError...
RuntimeError: CUDA out of memory. Tried to allocate 116.00 MiB (GPU
Hi @bananajoe182, sorry you’re running into this. Could you share more details about the try/except
code (what does it look like, and where are you calling it in the case where it works, and in the case where it doesn’t work?). It’s interesting that it works for you in ordinary Python but doesn’t work in a Ray Serve replica.
Hey @architkulkarni ,
so this code handles the runtime exception ‘Cuda out of memory…’ as it should (continue with cpu) when run locally but somehow aborts the whole process when the exception is thrown in a serve replica.
import numpy as np
import ray
import utils
import torch
from starlette.requests import Request
from ray import serve
from ray.exceptions import RayTaskError
if torch.cuda.is_available():
device = torch.device("cuda")
else:
device = torch.device("cpu")
@serve.deployment(route_prefix="/predict")
class Lama:
def __init__(self):
model_path = 'big-lama.pt'
self.model = torch.jit.load(model_path)
self.model.eval()
async def __call__(self, http_request: Request) -> bytes:
form = await http_request.form()
image_stream: bytes = await form['file'].read()
info_stream: bytes = await form['info'].read()
img = np.fromstring(image_stream, dtype='<f4')
info = np.fromstring(info_stream, count=3, dtype=np.intc)
width, height, channels = info[0], info[1], info[2]
img = np.reshape(img, (channels, height, width))
img = np.transpose(img, (1, 2, 0))
img = np.flipud(img)
rgb = img[:, :, :3].transpose(2, 0, 1)
mask = img[np.newaxis, :, :, 3]
rgb = utils.pad_img_to_modulo(rgb, 8)
mask = utils.pad_img_to_modulo(mask, 8)
rgb_tensor = torch.from_numpy(rgb).to(device).unsqueeze(0)
mask_tensor = torch.from_numpy(mask).to(device).unsqueeze(0)
mask_tensor = (mask_tensor > 0) * 1
# DO image processing here
result = self.infer(rgb_tensor, mask_tensor)
img = result[0, ::, :height, :width].permute(1, 2, 0).detach().cpu().numpy()
img = np.flipud(img)
img = np.transpose(img, (2, 0, 1))
output_file = img.tobytes()
rgb_tensor = None
mask_tensor = None
result = None
self.model.to(device)
torch.cuda.empty_cache()
return output_file
def infer(self, rgb_tensor, mask_tensor):
try:
result = self.model(rgb_tensor, mask_tensor)
except RuntimeError as e:
if str(e).startswith('CUDA out of memory.'):
print('Using cpu...')
self.model.to(torch.device('cpu'))
rgb_tensor = rgb_tensor.to(torch.device('cpu'))
mask_tensor = mask_tensor.to(torch.device('cpu'))
result = self.model(rgb_tensor, mask_tensor)
print('Done')
else:
raise e
return result
lama = Lama.bind()
Thanks for sharing the error, that does seem pretty bizarre. You’re catching the exception within the infer
function, so I don’t see how an exception could be raised. Is there more traceback that could show us where exactly the exception is being raised from? (Perhaps in somewhere in the actor logs?)
I also wonder if you’re able to reproduce this with a very minimal example to take CUDA out of the picture (e.g. a Serve replica that does nothing except manually raise an exception inside __call__
and try-catch it)