Hi all! I’m implementing a simple LLM server with ray and vllm, that supports continuous batching. The code now works nicely on a single gpu, but for larger models where I need multiple gpus, the code never learns to use multiple, even when I made clear that two gpus are available.
The full code and command are uploaded in this github repo, and are directly runnable.
Thanks for any help in advance!
@serve.deployment(ray_actor_options={"num_gpus": 2})
class VLLMPredictDeployment:
...