Hello, we are having an issue running XGBoost on Ray on GCE GPU machines. What appears to us is that XGBoost can not find any GPU although we can see GPUs on nvidia-smi. We tried different GPU types and got same result.
One thing we noticed is that in the Ray job printout in the notebook, the “System Info” is reporting more GPUs than there actually are. For example when we have three workers each with a single GPU its reporting eight GPUs instead. We are wondering if this is related to the error.
We found that this always happened when the underlying physical machine has more GPUs than the container. Ray is trying to use the cores that are not assigned to the container. Is there a bug in Ray’s GPU discovery code that reads the physical machine instead of the container. We ran nvidia-smi in the container and confirmed the container could only see one GPU although the physical machine has four.
Ok we have found the root cause: the GPU autodetection tries to use GPUtil.getGPUs() if GPUtil is available. This would get the correct number of gpus available to the container because it’s ultimately just shelling out to nvidia-smi. However GPUtil package is not installed, ray instead falls back to returning the number of files in /proc/driver/nvidia/gpus, which returns the number of GPUs attached to the host VM and not the number of GPUs provisioned to the container.
This sounds like a bug. Can we file a bug report on this?