Issues with gpu usage when Ray Data is used in docker

We are using Ray Datasets to transform a set of PDF files, extract information from them, perform some NLP stuff like classification, Named Entity Recognition, generate embedding etc. It’s a pipeline that transforms the dataset, adding columns to it as each row passes through it. At least three steps in the pipeline use GPU for inference. The pipeline is deployed as a Ray Serve based service with FastAPI ingress. The pipeline runs in a spate actor asynchronously though from the main ingress process. The FASTAPI deployment is used to kick off the pipeline on a bunch of files in a GCS bucket.

What we are seeing is that the when we run the while deployment directly on a VM, it used the GPU just fine. But as soon as we dockerize it and run it in a container, although the GPU is available inside the container to Pytorch, and we have given the num_gpus = 1 to Ray start, the GPU is not being used properly. We see some actors on the GPU with nvidia-smi command but the utilization remains 0% throughout the running of the pipeline. This is hampering us from putting this deployment in Kubernetes or otherwise in Docker containers.
Another thing to note is that if we don’t use Ray Datasets and directly run similar code with GPUs in actors, it works fine within containers.

We are using Ray version 2.4.0

Hi @umar-dreamai,

does this behavior also come up when you directly operate in the container?

E.g., can you open an interactive python shell in the container, and run things like

import ray
ray.init()
print(ray.cluster_resources())

and try to schedule a task or actor with GPUs?

How are you calling the dataset operation?

My first intuition here is that maybe the cluster connection somehow fails in the container. When using ray datasets, it will call ray.init() automatically if it’s not already connected. Maybe it can’t find the cluster (e.g. because the RAY_ADDRESS variable is unset) and starts a new - local - cluster instead.