Issues with gpu usage when Ray Data is used in docker

umar-dreamai · June 7, 2023, 9:31am

We are using Ray Datasets to transform a set of PDF files, extract information from them, perform some NLP stuff like classification, Named Entity Recognition, generate embedding etc. It’s a pipeline that transforms the dataset, adding columns to it as each row passes through it. At least three steps in the pipeline use GPU for inference. The pipeline is deployed as a Ray Serve based service with FastAPI ingress. The pipeline runs in a spate actor asynchronously though from the main ingress process. The FASTAPI deployment is used to kick off the pipeline on a bunch of files in a GCS bucket.

What we are seeing is that the when we run the while deployment directly on a VM, it used the GPU just fine. But as soon as we dockerize it and run it in a container, although the GPU is available inside the container to Pytorch, and we have given the num_gpus = 1 to Ray start, the GPU is not being used properly. We see some actors on the GPU with nvidia-smi command but the utilization remains 0% throughout the running of the pipeline. This is hampering us from putting this deployment in Kubernetes or otherwise in Docker containers.
Another thing to note is that if we don’t use Ray Datasets and directly run similar code with GPUs in actors, it works fine within containers.

We are using Ray version 2.4.0

kai · June 14, 2023, 8:16am

Hi @umar-dreamai,

does this behavior also come up when you directly operate in the container?

E.g., can you open an interactive python shell in the container, and run things like

import ray
ray.init()
print(ray.cluster_resources())

and try to schedule a task or actor with GPUs?

How are you calling the dataset operation?

My first intuition here is that maybe the cluster connection somehow fails in the container. When using ray datasets, it will call ray.init() automatically if it’s not already connected. Maybe it can’t find the cluster (e.g. because the RAY_ADDRESS variable is unset) and starts a new - local - cluster instead.

Topic		Replies	Views
Ray Serve container runtime_env cannot use GPU Ray Serve	3	798	December 6, 2023
Trying to deploy ray with docker Ray Serve	2	4133	February 16, 2021
Ray jobs stuck pending in docker container when using GPU on mnist example Ray Tune	4	962	July 12, 2021
Ray python parallel processing of deep learning model on multiple docker Ray Client	3	831	September 1, 2022
Using Ray Multiprocessing on Docker Ray Core	1	545	March 7, 2022

Issues with gpu usage when Ray Data is used in docker

Related topics