RuntimeError: CUDA error: invalid device ordinal issue with running CIFAR example in pytorch

Hi @kai I see that you never got a response about this, and I am currently having the same issue. My output for that code was this:

2024-09-11 15:12:49,508 INFO util.py:382 -- setting max workers for head node type to 0
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: 34.83.226.52
Shared connection to 34.83.226.52 closed.
Shared connection to 34.83.226.52 closed.
2024-09-11 15:12:59,617 INFO util.py:382 -- setting max workers for head node type to 0
Fetched IP: 34.83.226.52
Shared connection to 34.83.226.52 closed.
2024-09-11 15:13:06,505 INFO worker.py:1585 -- Connecting to existing Ray cluster at address: 10.138.0.44:6379...
2024-09-11 15:13:06,513 INFO worker.py:1761 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 
{'CPU': 2.0,
 'GPU': 2.0,
 'accelerator_type:T4': 1.0,
 'memory': 4382748672.0,
 'node:10.138.0.44': 1.0,
 'node:__internal_head__': 1.0,
 'object_store_memory': 2191374336.0}
Traceback (most recent call last):
  File "/home/ray/CudaTest.py", line 6, in <module>
    pprint.pprint(os.environ["CUDA_VISIBLE_DEVICES"])
  File "/home/ray/anaconda3/lib/python3.9/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'CUDA_VISIBLE_DEVICES'
Shared connection to 34.83.226.52 closed.
Error: Command failed:

  ssh -tt -i ~/.ssh/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1cecef3852/32eb62159c/%C -o ControlPersist=10s -o ConnectTimeout=120s ret_raiinmaker_com@34.83.226.52 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker exec -it  ray_nvidia_docker /bin/bash -c '"'"'bash --login -c -i '"'"'"'"'"'"'"'"'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (python /home/ray/CudaTest.py)'"'"'"'"'"'"'"'"''"'"' )'

I am using python 3.9.19, Here is some output from my head node using cli:

(base) ret_raiinmaker_com@ray-gpu-docker-head-79834ba0-compute:/$ ray --version
ray, version 2.8.1
(base) ret_raiinmaker_com@ray-gpu-docker-head-79834ba0-compute:/$ docker exec -it ray_nvidia_docker ray --version
2024-09-11 15:24:44,215 - INFO - NumExpr defaulting to 2 threads.
ray, version 2.30.0
(base) ret_raiinmaker_com@ray-gpu-docker-head-79834ba0-compute:/$ docker exec -it ray_nvidia_docker python -c "import torch; print(torch.__version__)"
2.0.1+cu118

I just opened a discussion of my own about a similar issue, here is the link: Cuda Error: invalid device ordinal during training on GCP cluster