Ray serve GPU allocation error, deployment consuming all 8 GPU even though setting num_gpus=4

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I use Ray serve to deploy a vLLM server on a DGX H100 machine with 8 GPUs.
It works fine with num_replicas=1, but fails with num_replicas=2. The error show no GPU available.

The following resource request cannot be scheduled right now: {'CPU': 2.0, 'GPU': 4.0}

I used ray.available_resources() to check available resources
before deployment: ray.available_resources() shows {GPU:8}
after deployment using num_replicas=1, ray.available_resources() shows no GPU

Below is my ray serve deployment configuration:

serve.run(
            VLLMSever.options(
                    num_replicas=2, 
                    ray_actor_options={'num_cpus': 2, 'num_gpus': 4.0}
                ).bind(
                    model="/models/Llama-2-70b-chat-hf", 
                    tensor_parallel_size=4
                )
        )

nvidia-smi shows no process using GPU before the ray serve deployment

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA H100 80G...  On   | 00000000:1B:00.0 Off |                    0 |
| N/A   25C    P0    76W / 700W |      0MiB / 81559MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 80G...  On   | 00000000:43:00.0 Off |                    0 |
| N/A   27C    P0    75W / 700W |      0MiB / 81559MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA H100 80G...  On   | 00000000:52:00.0 Off |                    0 |
| N/A   31C    P0    76W / 700W |      0MiB / 81559MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA H100 80G...  On   | 00000000:61:00.0 Off |                    0 |
| N/A   29C    P0    74W / 700W |      0MiB / 81559MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA H100 80G...  On   | 00000000:9D:00.0 Off |                    0 |
| N/A   29C    P0    75W / 700W |      0MiB / 81559MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA H100 80G...  On   | 00000000:C3:00.0 Off |                    0 |
| N/A   27C    P0   112W / 700W |      0MiB / 81559MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA H100 80G...  On   | 00000000:D1:00.0 Off |                    0 |
| N/A   33C    P0    75W / 700W |      0MiB / 81559MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA H100 80G...  On   | 00000000:DF:00.0 Off |                    0 |
| N/A   34C    P0    81W / 700W |      0MiB / 81559MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

figured it out now, by default, vLLM internally is using ray to allocate resource. it’s double allocation.