How severe does this issue affect your experience of using Ray?
High: It blocks me to complete my task.
I am trying to test serving an LLM on a local machine using FastAPI and it takes a while to load - the FastAPI __init__()
method takes a while (more than 30 seconds) causing the deployment to go into a restart loop and I get this message in the logs:
“Deployment ‘my_deployment’ in application ‘my_app’ has 1 replicas that have taken more than 30s to initialize. This may be caused by a slow init or reconfigure method.”
The message is correct, the init is slow, but there is nothing I can do about that at this stage. Is there a way to remove the init method’s timeout or to set it to a longer time?
Thanks
Hi @nelsonrogers, thanks for posting. That message is a warning, not an error. The deployment replica does not restart every 30s– it continues to initialize, and it emits that message every 30s to signal that the __init__
method is continuing but is slow. Does the replica eventually start?
Ok, thanks for the reply.
Unfortunately, the replica does not start. It cuts out in the middle of loading the model (it loads 2 out of 4 shards of the model, based on the logs) and it suddenly terminates without giving any further error messages (or maybe it does be the replica dies before I can see it, at least). It just ends up doing this in a loop until I kill it manually.
This is my code if it helps:
from fastapi import FastAPI, HTTPException
from ray import serve
from pydantic import BaseModel
from transformers import pipeline
import torch
from huggingface_hub import login
app = FastAPI()
# Define a Pydantic model for the request body
class Message(BaseModel):
role: str
content: str
class Messages(BaseModel):
messages: list[Message]
# Define the deployment
@serve.deployment(name="MyModel", num_replicas=1)
@serve.ingress(app)
class MyModel:
def __init__(self):
# Log in to the HuggingFace Hub
login(token="my_token", add_to_git_credential=True)
# Initialize the transformers pipeline
model_id = "my_model_id"
self.pipe = pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device="cpu"
)
@app.post("/")
async def generate(self, messages: Messages):
try:
# Prepare the messages for the model
messages=[{"role": msg.role, "content": msg.content} for msg in messages.messages]
# Generate the response using the pipeline
outputs = self.pipe(
messages,
max_new_tokens=256,
do_sample=False,
)
generated_text = outputs[0]["generated_text"][-1]["content"]
# Return the generated response
return {"generated_text": generated_text}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
# The deployment is created here
deployment = MyModel.bind()