Deployment's init function takes too long to load model

ankur_ankur · April 8, 2023, 3:23pm

hello,

I’m loading a large language model in the init function of my Ray serve deployment. The model takes some time to load and Ray constantly tears down and recreates the deployment with messages such as these:

Recovering target state for deployment Generate from checkpoint…
(ServeController pid=2866163) INFO 2023-04-08 11:17:07,369 controller 2866163 deployment_state.py:1310 - Adding 1 replica to deployment ‘Generate’.

How do I increase the time allowed to load a model in the init function of a deployment?

Sihan_Wang · April 11, 2023, 11:02pm

Hi @ankur_ankur! Glad to hear you that you are exploring ray serve!
From the log, it looks like the serve controller is dead (and restarted), can you double check the node resources when you load the model? If you check /tmp/ray/session_latest/logs/serve/ you should see multiple controller log, can you check all them if there are any failures?

Jules_Damji · April 11, 2023, 11:07pm

@Sihan_Wang btw @ankur_ankur is working on an LLM model serving with Ray Serve example.

ankur_ankur · April 12, 2023, 2:37am

Thanks to you both! I resolved the issue. It was unrelated to time taken to finish the init function and indeed related to node resources. The model just needed more RAM to load successfully.

Jules_Damji · April 12, 2023, 3:01am

@ankur_ankur Did the log tell you anything about memory starvation? That way, we can improve or notify with more informative error message.

Topic		Replies	Views
Deployment has taken more than 30s to initialize. This may be caused by a slow __init__ or reconfigure method Ray Serve	2	259	September 16, 2024
Replica initialize is too slow with load_model into replica init Ray Clusters	0	286	June 28, 2023
Ray Serve Replica taking a lot of memory before requests even come in Ray Serve	3	500	September 29, 2021
Ray Serve: custom resource optimization Ray Serve	3	471	January 26, 2023
Resources allocation during serve deployment Ray Serve	5	662	December 3, 2022

Deployment's init function takes too long to load model

Related topics