RuntimeError: CUDA error: invalid device ordinal issue with running CIFAR example in pytorch

Ret_Beauregard · September 11, 2024, 10:27pm

Hi @kai I see that you never got a response about this, and I am currently having the same issue. My output for that code was this:

2024-09-11 15:12:49,508 INFO util.py:382 -- setting max workers for head node type to 0
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: 34.83.226.52
Shared connection to 34.83.226.52 closed.
Shared connection to 34.83.226.52 closed.
2024-09-11 15:12:59,617 INFO util.py:382 -- setting max workers for head node type to 0
Fetched IP: 34.83.226.52
Shared connection to 34.83.226.52 closed.
2024-09-11 15:13:06,505 INFO worker.py:1585 -- Connecting to existing Ray cluster at address: 10.138.0.44:6379...
2024-09-11 15:13:06,513 INFO worker.py:1761 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 
{'CPU': 2.0,
 'GPU': 2.0,
 'accelerator_type:T4': 1.0,
 'memory': 4382748672.0,
 'node:10.138.0.44': 1.0,
 'node:__internal_head__': 1.0,
 'object_store_memory': 2191374336.0}
Traceback (most recent call last):
  File "/home/ray/CudaTest.py", line 6, in <module>
    pprint.pprint(os.environ["CUDA_VISIBLE_DEVICES"])
  File "/home/ray/anaconda3/lib/python3.9/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'CUDA_VISIBLE_DEVICES'
Shared connection to 34.83.226.52 closed.
Error: Command failed:

  ssh -tt -i ~/.ssh/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1cecef3852/32eb62159c/%C -o ControlPersist=10s -o ConnectTimeout=120s ret_raiinmaker_com@34.83.226.52 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker exec -it  ray_nvidia_docker /bin/bash -c '"'"'bash --login -c -i '"'"'"'"'"'"'"'"'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (python /home/ray/CudaTest.py)'"'"'"'"'"'"'"'"''"'"' )'

I am using python 3.9.19, Here is some output from my head node using cli:

(base) ret_raiinmaker_com@ray-gpu-docker-head-79834ba0-compute:/$ ray --version
ray, version 2.8.1
(base) ret_raiinmaker_com@ray-gpu-docker-head-79834ba0-compute:/$ docker exec -it ray_nvidia_docker ray --version
2024-09-11 15:24:44,215 - INFO - NumExpr defaulting to 2 threads.
ray, version 2.30.0
(base) ret_raiinmaker_com@ray-gpu-docker-head-79834ba0-compute:/$ docker exec -it ray_nvidia_docker python -c "import torch; print(torch.__version__)"
2.0.1+cu118

I just opened a discussion of my own about a similar issue, here is the link: Cuda Error: invalid device ordinal during training on GCP cluster

Topic		Replies	Views
Cuda Error: invalid device ordinal during training on GCP cluster	0	199	September 11, 2024
Pytorch+ray train example not working Ray Train	4	797	November 9, 2023
Status: all CUDA-capable devices are busy or unavailable Ray Tune	7	1815	February 15, 2022
Getting Started with Ray does not work on any computer I try it Ray Tune	4	2415	September 13, 2023
RuntimeError: No CUDA GPUs are available Ray Tune	12	14868	February 3, 2023

RuntimeError: CUDA error: invalid device ordinal issue with running CIFAR example in pytorch

Related topics