I am basically using vLLM which uses ray for initializing clusters. I am facing difficulties in specifying GPU usage for different models for LLM inference pipeline. Specifically, I have 4 RTX 4090 GPUs available, and I want to run a LLM with a size of 42GB on 2 RTX 4090 GPUs (~48GB) and a separate model with a size of 22GB on 1 RTX 4090 GPU(~24GB).
However, I haven’t found a straightforward method within the ray library to specify which GPU should be used for each model.
from vllm import LLM
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
llm_1 = LLM(llm_1_name,max_model_len=50,gpu_memory_utilization=0.9, tensor_parallel_size=2)
os.environ["CUDA_VISIBLE_DEVICES"] = ""
os.environ["CUDA_VISIBLE_DEVICES"] = "2"
llm_2 = LLM(llm_2_name,max_model_len=50,gpu_memory_utilization=0.9, tensor_parallel_size=1)
It loads first llm on 2 gpus and when it tries to reintialize ray cluster for 2nd instance of llm It gives following error
RuntimeError: torch.distributed is already initialized but the torch world size does not match parallel_config.world_size (2 vs. 1).