Deployment's init function takes too long to load model


I’m loading a large language model in the init function of my Ray serve deployment. The model takes some time to load and Ray constantly tears down and recreates the deployment with messages such as these:

Recovering target state for deployment Generate from checkpoint…
(ServeController pid=2866163) INFO 2023-04-08 11:17:07,369 controller 2866163 - Adding 1 replica to deployment ‘Generate’.

How do I increase the time allowed to load a model in the init function of a deployment?

Hi @ankur_ankur! Glad to hear you that you are exploring ray serve!
From the log, it looks like the serve controller is dead (and restarted), can you double check the node resources when you load the model? If you check /tmp/ray/session_latest/logs/serve/ you should see multiple controller log, can you check all them if there are any failures?

@Sihan_Wang btw @ankur_ankur is working on an LLM model serving with Ray Serve example.

Thanks to you both! I resolved the issue. It was unrelated to time taken to finish the init function and indeed related to node resources. The model just needed more RAM to load successfully.

@ankur_ankur Did the log tell you anything about memory starvation? That way, we can improve or notify with more informative error message.