Serving LLM with multiple gpus

Boyuan_Chen · July 3, 2024, 1:29pm

Hi all! I’m implementing a simple LLM server with ray and vllm, that supports continuous batching. The code now works nicely on a single gpu, but for larger models where I need multiple gpus, the code never learns to use multiple, even when I made clear that two gpus are available.
The full code and command are uploaded in this github repo, and are directly runnable.
Thanks for any help in advance!

@serve.deployment(ray_actor_options={"num_gpus": 2})
class VLLMPredictDeployment:
    ...

Topic		Replies	Views
vLLM Inferencing on multiGPU Ray Serve	7	1052	September 24, 2024
Scaling Ray serve with vLLM beyond 2 GPUs Ray Serve	1	2373	February 5, 2024
Multi GPU Usage on Multi VM\|Ray cluster on multi VM instances Ray Clusters	5	1435	January 17, 2025
Best practices to run multiple models in multiple GPUs in RayLLM Ray Train	0	723	February 8, 2024
Ray serve GPU allocation error, deployment consuming all 8 GPU even though setting num_gpus=4 Ray Serve	1	671	February 2, 2024

Serving LLM with multiple gpus

Related topics