How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I use Ray serve to deploy a vLLM server on a DGX H100 machine with 8 GPUs.
It works fine with num_replicas=1
, but fails with num_replicas=2
. The error show no GPU available.
The following resource request cannot be scheduled right now: {'CPU': 2.0, 'GPU': 4.0}
I used ray.available_resources()
to check available resources
before deployment: ray.available_resources() shows {GPU:8}
after deployment using num_replicas=1
, ray.available_resources() shows no GPU
Below is my ray serve deployment configuration:
serve.run(
VLLMSever.options(
num_replicas=2,
ray_actor_options={'num_cpus': 2, 'num_gpus': 4.0}
).bind(
model="/models/Llama-2-70b-chat-hf",
tensor_parallel_size=4
)
)
nvidia-smi shows no process using GPU before the ray serve deployment
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA H100 80G... On | 00000000:1B:00.0 Off | 0 |
| N/A 25C P0 76W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA H100 80G... On | 00000000:43:00.0 Off | 0 |
| N/A 27C P0 75W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA H100 80G... On | 00000000:52:00.0 Off | 0 |
| N/A 31C P0 76W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA H100 80G... On | 00000000:61:00.0 Off | 0 |
| N/A 29C P0 74W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA H100 80G... On | 00000000:9D:00.0 Off | 0 |
| N/A 29C P0 75W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA H100 80G... On | 00000000:C3:00.0 Off | 0 |
| N/A 27C P0 112W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA H100 80G... On | 00000000:D1:00.0 Off | 0 |
| N/A 33C P0 75W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA H100 80G... On | 00000000:DF:00.0 Off | 0 |
| N/A 34C P0 81W / 700W | 0MiB / 81559MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+