About kuberay GPU multi-tenancy

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hi, I’m using Kuberay operator on my k8s cluster and encountered a GPU multi-tenancy issue. One k8s node has 8 GPUs while other nodes only have CPUs. The problem is that not all processes on the GPU node is managed by k8s, and there are other model not using ray that already occupies 2 GPU (90% GPU memory usage) on that node. When I try to start a pod that have 2 GPU, it seems that kuberay is unaware of the GPU usage status and still may by chance allocate the already occupied GPU to the pod. Then the model in the pod raised a cuda out of memory error. Since there are other nodes in the k8s cluster that only have CPU. The only workaround I can think of (but may not work) is that using node selector to force the pod to be scheduled on the GPU node, while NOT specify GPU resource for it. Then according to Using GPUs — Ray 2.11.0, that pod can see all GPU on that node and then manually allocate GPU for the model.

Do you have any idea on how to deal with this issue? I think maybe a built-in resource type like “GPU-memory” could help? Thank you!