Model creation error handling

Ray: 1.4

When creating and deploying a model such as

SomeRayImageModel.options(init_args=model_args, num_replicas=NUM_REPS, name=model_id).deploy()

what is a “best practice”/suggestion for error handling? For example, lets say that the “SomeRayImageModel” is initializing the model as part of its creation and there is an exception (e.g., using pytorch and the model download fails)

In my specific, I see the exception is thrown in the init, and the ray serve framework reports

2021-07-22 09:09:53.566 | ERROR    | app.inference.pytorch.base_image_model:__init__:20 - OOPS...exception during model creation <urlopen error [Errno 8] nodename nor servname provided, or not known>
(pid=13782) 2021-07-22 09:09:53,571	ERROR -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::HrTJVO:SERVE_CONTROLLER_ACTOR:wide_resnet50_2#QZFjIF:RayServeWrappedReplica.__init__ (pid=13782, ip=
(pid=13782)   File "/Users/developer/.pyenv/versions/3.8.6/lib/python3.8/http/", line 1255, in request
(pid=13782)     self._send_request(method, url, body, headers, encode_chunked)

and then seems to get in a loop where it continues to try to create/deploy the same model.

Any suggestions wrt error handling, ray serve, and creating/deploying models?



For failures in deployment constructor, what we currently have is not the best experience and we’re actively working on improving it in [serve] Better behavior when deployment constructor fails · Issue #16114 · ray-project/ray · GitHub.

So far the proposed behavior is:

  1. Surface error in log and don’t loop forever
  2. deploy() can return “success” if we have at least 1 replica running at desired model version
  3. If all replicas failed after 3 retires, we consider this deploy() call failed, return and terminate.

Do you think that matches your expected behavior ? Inputs are welcomed as this is still WIP :wink:


Thanks for the reply.

The proposed behavior seems fine. I like the return of a “success” and would also like to have some sort of “error” object returned for the failure case(s). Or if async behavior is desired, the ability to pass in a callable that can be invoked for both success and failure cases so that we have to way to deal with those cases programmatically in the application.