I have implemented Langchain Ollama service and deployed it with Rayserve. I am given @serve.deployment(num_replicas=3, ray_actor_options={“num_gpus”: 0.5}), so ideally it should create 3 instances of service, and should occupy 1.5 gpus, instead it is only creating 1 instance and occupying 0.5 gpu. I don’t know why it is happening. I want 3 instances of Langchain. Is it the case where Langchain is not allowing to load same model multiple times?
Hey @Rahil_Doshi there is probably some kind of configuration issue here. Could you share more about how you’re deploying Serve and could you also post the logs from the serve controller? (/tmp/ray/session_latest/logs/controller_<pid>.log
)