Multi GPU Usage on Multi VM|Ray cluster on multi VM instances

Background:
I want to try the LLM model, for example, flan-ul2 onto the two VM A10 GPUs provided by AWS, Each VM has 4 GPUs, so in my ray cluster I would have in total of 8 GPUs. Now, I want to create a ray cluster, which I already did by running the following commands:

on head node:
ray start --head

on worker node:
ray start --address=“:”

But now in my code where I created a Python class which I want to deploy, I want to share 6 GPUs for the task among the worker and head node, how can I proceed?

any leads can be beneficial.

  • High: It blocks me from completing my task.

@Shobhit_Agarwal Here is a goodhttps://docs.ray.io/en/latest/ray-air/examples/gptj_serving.html example how you can use Ray Serve and Ray to serve an LLM model. For this model,
we use 16GB GPUs. We allocate one GPU per replica, so 6 replicas will have 6 GPUs.

@Jules_Damji really appreciate the quick response. But the thing is if i set num_replicas=6 and num_gpus=1, that means i am making 6 copies of it and each copy is utilising 1 GPU, please correct me if i am wrong.

The problem is I can’t be using single GPU for the LLM, I need at least 5/6 GPUs to serve the flan ul2 model since it is huge. So after creating the cluster, in my deployment class, I am setting num_gpus=6, num_replicas=1, but I am getting an error saying that, no resource can accommodate num_gpus=6, any leads can be helpful.

any leads would be really helpful.

@Jules_Damji, I have a scenario, where I create a ray cluster with 2 VMs, each having 4 GPUs, how can I distribute my ray serve that utilizes 4 GPUs from the first instance and 1 GPU from another instance? is there a workaround for this?

I created a cluster
ray start --head on head node,
and ray start --address= on worker node

and assigned num_gpus=5 in @serve.deployment class, but still I am getting the below error message:
no available node types can fulfill resource request {‘gpu’: 5.0},

even when I see resources available: {“gpu”: 8.0}

I hope there should be a workaround for this.