How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I am trying to run a ray serve deployment with docker containers, but I am not able to get the replicas to use gpu. I am running on a local setup with 1 gpu.
docker-compose file for ray head:
ray_head:
image: ray_head
container_name: ray_head
build:
context: ./ray_pipeline/ray_head
dockerfile: Dockerfile
command: ray start --head --dashboard-host=0.0.0.0 --block
ports:
- 8265:8265 # ray dashboard
- 8000:8000 # ray serve
- 10001:10001 # ray client
restart: always
privileged: true
volumes:
- /dev/shm:/dev/shm
- /var/lib/containers:/var/lib/containers
environment:
- NVIDIA_VISIBLE_DEVICES=all
networks:
- ray_network
Snippet for ray serve deployment:
import ray
runtime_env = {
"container": {
"image": "detector:latest",
"run_options": ["--gpus all", "-v /dev/shm:/dev/shm", "--privileged", "--log-level=debug"]
}
}
@serve.deployment(num_replicas=1, ray_actor_options={"num_cpus": 1, "num_gpus": 0.25, "runtime_env": runtime_env})
class RayDetector(object):
def __init__(self):
import os
import torch
print(f'# ray.get_gpu_ids(): {ray.get_gpu_ids()}')
print(f'# os.environ["CUDA_VISIBLE_DEVICES"]: {os.environ["CUDA_VISIBLE_DEVICES"]}')
print(f'# torch.cuda.is_available(): {torch.cuda.is_available()}')
# initialize object detector
Snippet from main code:
import ray
from ray import serve
ray.init(address='ray://localhost:10001')
serve.start(detached=True, http_options={"host": "0.0.0.0"})
RayDetector.deploy()
ray_detector_handle = serve.get_deployment('RayDetector').get_handle()
When I run the above code, I get the below:
ray.get_gpu_ids(): [0]
os.environ["CUDA_VISIBLE_DEVICES"]: 0
torch.cuda.is_available(): False
What could be the issue here? It seems that the gpu device is properly detected, but pytorch is not able to use it.