Model creation error handling

puntime_error · July 22, 2021, 1:16pm

Ray: 1.4

When creating and deploying a model such as

SomeRayImageModel.options(init_args=model_args, num_replicas=NUM_REPS, name=model_id).deploy()

what is a “best practice”/suggestion for error handling? For example, lets say that the “SomeRayImageModel” is initializing the model as part of its creation and there is an exception (e.g., using pytorch and the model download fails)

In my specific, I see the exception is thrown in the init, and the ray serve framework reports

2021-07-22 09:09:53.566 | ERROR    | app.inference.pytorch.base_image_model:__init__:20 - OOPS...exception during model creation <urlopen error [Errno 8] nodename nor servname provided, or not known>
(pid=13782) 2021-07-22 09:09:53,571	ERROR worker.py:418 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::HrTJVO:SERVE_CONTROLLER_ACTOR:wide_resnet50_2#QZFjIF:RayServeWrappedReplica.__init__ (pid=13782, ip=127.0.0.1)
(pid=13782)   File "/Users/developer/.pyenv/versions/3.8.6/lib/python3.8/http/client.py", line 1255, in request
(pid=13782)     self._send_request(method, url, body, headers, encode_chunked)

and then seems to get in a loop where it continues to try to create/deploy the same model.

Any suggestions wrt error handling, ray serve, and creating/deploying models?

Thanks

jiaodong · August 5, 2021, 9:14pm

Hi,

For failures in deployment constructor, what we currently have is not the best experience and we’re actively working on improving it in [serve] Better behavior when deployment constructor fails · Issue #16114 · ray-project/ray · GitHub.

So far the proposed behavior is:

Surface error in log and don’t loop forever
deploy() can return “success” if we have at least 1 replica running at desired model version
If all replicas failed after 3 retires, we consider this deploy() call failed, return and terminate.

Do you think that matches your expected behavior ? Inputs are welcomed as this is still WIP

Jiao

puntime_error · September 17, 2021, 3:56pm

Thanks for the reply.

The proposed behavior seems fine. I like the return of a “success” and would also like to have some sort of “error” object returned for the failure case(s). Or if async behavior is desired, the ability to pass in a callable that can be invoked for both success and failure cases so that we have to way to deal with those cases programmatically in the application.

Topic		Replies	Views
Exception raised in creation task: The actor died because of an error raised in its creation task Ray Serve	2	1259	January 19, 2024
Can't deploy models using Ray Serve	1	559	January 25, 2024
[Serve] Ray Serve, RayActorError: The actor died unexpectedly before finishing this task Ray Serve	1	1259	April 22, 2021
Actor creation causes serve.deployment to error with ray.init twice	2	558	July 11, 2023
The example of ray serve deploying a service using serve deploy does not work Ray Serve	0	321	September 6, 2023

Model creation error handling

Related topics