Ray Serve container runtime_env cannot use GPU

yhsmiley · April 27, 2023, 11:11am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I am trying to run a ray serve deployment with docker containers, but I am not able to get the replicas to use gpu. I am running on a local setup with 1 gpu.

docker-compose file for ray head:

    ray_head:
        image: ray_head
        container_name: ray_head
        build:
            context: ./ray_pipeline/ray_head
            dockerfile: Dockerfile
        command: ray start --head --dashboard-host=0.0.0.0 --block
        ports:
            - 8265:8265     # ray dashboard
            - 8000:8000     # ray serve
            - 10001:10001   # ray client
        restart: always
        privileged: true
        volumes:
            - /dev/shm:/dev/shm
            - /var/lib/containers:/var/lib/containers
        environment:
            - NVIDIA_VISIBLE_DEVICES=all
        networks:
            - ray_network

Snippet for ray serve deployment:

import ray

runtime_env = {
    "container": {
        "image": "detector:latest",
        "run_options": ["--gpus all", "-v /dev/shm:/dev/shm", "--privileged", "--log-level=debug"]
    }
}
@serve.deployment(num_replicas=1, ray_actor_options={"num_cpus": 1, "num_gpus": 0.25, "runtime_env": runtime_env})
class RayDetector(object):
    def __init__(self):
        import os
        import torch
        print(f'# ray.get_gpu_ids(): {ray.get_gpu_ids()}')
        print(f'# os.environ["CUDA_VISIBLE_DEVICES"]: {os.environ["CUDA_VISIBLE_DEVICES"]}')
        print(f'# torch.cuda.is_available(): {torch.cuda.is_available()}')
        
        # initialize object detector

Snippet from main code:

import ray
from ray import serve

ray.init(address='ray://localhost:10001')
serve.start(detached=True, http_options={"host": "0.0.0.0"})

RayDetector.deploy()
ray_detector_handle = serve.get_deployment('RayDetector').get_handle()

When I run the above code, I get the below:

ray.get_gpu_ids(): [0]
os.environ["CUDA_VISIBLE_DEVICES"]: 0
torch.cuda.is_available(): False

What could be the issue here? It seems that the gpu device is properly detected, but pytorch is not able to use it.

shrekris · June 5, 2023, 6:25pm

Is there any chance this is an issue with the Docker container itself? What do you see if you run the container and manually check torch.cuda.is_available() in the Python interpreter?

yhsmiley · June 14, 2023, 5:43am

I have solved this by installing nvidia-docker2 inside the ray head container.

Now I am running into another problem. On the dashboard ‘cluster’ page, I am not able to see the status (e.g. pending tasks, object refs in scope) of the replicas that are spinned up using podman.

psydok · December 6, 2023, 6:02am

Could you please tell me, did you manage to pull up the service image and run the service in a container this way?

Topic		Replies	Views
Trying to deploy ray with docker Ray Serve	2	4159	February 16, 2021
Issues with gpu usage when Ray Data is used in docker	1	272	June 14, 2023
Ray jobs stuck pending in docker container when using GPU on mnist example Ray Tune	4	967	July 12, 2021
Docker ray doesn't recognise cuda gpu within container Ray Core	0	121	April 30, 2024
NVIDIA GPU not deteted RLlib	3	490	October 3, 2021

Ray Serve container runtime_env cannot use GPU

Related topics