Spread accross several fractional GPUs or 1< num_gpus < 2


I’m wondering if there’s any way to have a ray actor use 2 GPUs and still have a second ray actor on the second GPU (e.g. first ray actor has num_gpus=1.5, second one has num_gpus = 0.5 — if I do this I get ValueError: Resource quantities >1 must be whole numbers.).

More context:
I want to call inference on two different models and I have 2 GPUs. The first model needs to be distributed across the two GPUs as it doesn’t fit into the memory of one. The second model fits onto 1 GPU and I would like to replicate it onto the other GPU so I can execute two inference requests to that model in parallel. I’m wondering if there is any way to do this.

Unfortunately the use case that you mention is not directly supported because the current Ray scheduling APIs (ray.get_runtime_context().get_accelerator_ids()) would not allow an actor to determine which fraction of which GPU it was allocated. Also, the current API for requesting resources is ambiguous on whether the 1.5 GPU request should be split evenly or bin-packed across GPUs. It’s an interesting use case, though, and if you are open to it, please feel free to open a feature request on github and/or contribute the feature.

For now, the best way would be to do this with custom resources and then set CUDA_VISIBLE_DEVICES yourself. For example, you could create a resource like {"gpu_slice": 3} on each node, request {"gpu_slice": 2} for the 1.5 GPU task and {"gpu_slice": 1} for the 0.5 GPU task.