Cuda Error: invalid device ordinal during training on GCP cluster

This is a high priority bug, blocking me from performing my task.

I am trying to use ray train in order to run a distributed training on a neural network. This is a test, as I am relatively new to ray so I am using the Fashion-MNIST example that is in the ray documentation. Here is the link to that code if anybody is curious: Train a PyTorch model on Fashion MNIST — Ray 2.35.0
Since this is code directly from the documentation, I do not expect the error to be in my training code. I have a cluster that gets successfully set up using ray up, and I have the ability to run simple remote code on the remote cluster, even on worker nodes. This is the simple code I am able to run successfully:

import ray
import time

ray.init(address = 'auto')

@ray.remote
def isprime(x):
    if x > 1:
        for i in range(2, x):
            if (x % i) == 0:
                return 0
        else:
            return x
    return 0

def main():
    lower = 9000000
    upper = 9010000
    primes = []
    objects = []
    start_time = time.time()

    for num in range(lower, upper + 1):
        x = isprime.remote(num)
        objects.append(x)
    objs = ray.get(objects)

    [primes.append(x) for x in objs if x > 0]
    print(len(primes), primes[0], primes[-1])
    print("Time Elapsed: ", (time.time() - start_time))

if __name__ == "__main__":
    main()

I am unsure exactly what is causing the issue when I run the training code though. Here is the full output of ray submit ray-cluster-docker.yaml Fashion-MNIST-Ray.py:

2024-09-11 14:43:53,729 INFO util.py:382 -- setting max workers for head node type to 0
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: 34.83.226.52
Shared connection to 34.83.226.52 closed.
Shared connection to 34.83.226.52 closed.
2024-09-11 14:44:05,240 INFO util.py:382 -- setting max workers for head node type to 0
Fetched IP: 34.83.226.52
Shared connection to 34.83.226.52 closed.
2024-09-11 14:44:13,289 INFO worker.py:1585 -- Connecting to existing Ray cluster at address: 10.138.0.44:6379...
2024-09-11 14:44:13,297 INFO worker.py:1761 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 
2024-09-11 14:44:13,383 INFO tune.py:253 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `<FrameworkTrainer>(...)`.

View detailed results here: /home/ray/ray_results/TorchTrainer_2024-09-11_14-44-13
To visualize your results with TensorBoard, run: `tensorboard --logdir /tmp/ray/session_2024-09-11_14-40-23_824443_2544/artifacts/2024-09-11_14-44-13/TorchTrainer_2024-09-11_14-44-13/driver_artifacts`

Training started with configuration:
╭─────────────────────────────────────────────────╮
│ Training config                                 │
├─────────────────────────────────────────────────┤
│ train_loop_config/batch_size_per_worker      16 │
│ train_loop_config/epochs                     10 │
│ train_loop_config/lr                      0.001 │
╰─────────────────────────────────────────────────╯
(RayTrainWorker pid=2997) Setting up process group for: env:// [rank=0, world_size=2]
(TorchTrainer pid=2961) Started distributed worker processes: 
(TorchTrainer pid=2961) - (ip=10.138.0.44, pid=2997) world_rank=0, local_rank=0, node_rank=0
(TorchTrainer pid=2961) - (ip=10.138.0.44, pid=2998) world_rank=1, local_rank=1, node_rank=0
(RayTrainWorker pid=2997) Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
2024-09-11 14:44:24,752 ERROR tune_controller.py:1331 -- Trial task failed for trial TorchTrainer_fc60e_00000
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/worker.py", line 2630, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/worker.py", line 863, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::_Inner.train() (pid=2961, ip=10.138.0.44, actor_id=3a9f05c238880bf01089ab5302000000, repr=TorchTrainer)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 331, in train
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 53, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(RuntimeError): ray::_RayTrainWorker__execute.get_next() (pid=2998, ip=10.138.0.44, actor_id=6baeb779a782789b5ab0d43302000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f7fe014ef10>)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 189, in train_fn
    with train_func_context():
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/torch/config.py", line 26, in __enter__
    torch.cuda.set_device(device)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/cuda/__init__.py", line 350, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Training errored after 0 iterations at 2024-09-11 14:44:24. Total running time: 10s
Error file: /tmp/ray/session_2024-09-11_14-40-23_824443_2544/artifacts/2024-09-11_14-44-13/TorchTrainer_2024-09-11_14-44-13/driver_artifacts/TorchTrainer_fc60e_00000_0_2024-09-11_14-44-13/error.txt
2024-09-11 14:44:24,762 INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/home/ray/ray_results/TorchTrainer_2024-09-11_14-44-13' in 0.0040s.

2024-09-11 14:44:24,765 ERROR tune.py:1037 -- Trials did not complete: [TorchTrainer_fc60e_00000]
ray.exceptions.RayTaskError(RuntimeError): ray::_Inner.train() (pid=2961, ip=10.138.0.44, actor_id=3a9f05c238880bf01089ab5302000000, repr=TorchTrainer)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 331, in train
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 53, in check_for_failure
    ray.get(object_ref)
ray.exceptions.RayTaskError(RuntimeError): ray::_RayTrainWorker__execute.get_next() (pid=2998, ip=10.138.0.44, actor_id=6baeb779a782789b5ab0d43302000000, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f7fe014ef10>)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/worker_group.py", line 33, in __execute
    raise skipped from exception_cause(skipped)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/_internal/utils.py", line 189, in train_fn
    with train_func_context():
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/torch/config.py", line 26, in __enter__
    torch.cuda.set_device(device)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/torch/cuda/__init__.py", line 350, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ray/Fashion-MNIST-Ray.py", line 163, in <module>
    train_fashion_mnist(num_workers=2, use_gpu=True)
  File "/home/ray/Fashion-MNIST-Ray.py", line 158, in train_fashion_mnist
    result = trainer.fit()
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/train/base_trainer.py", line 638, in fit
    raise TrainingFailedError(
ray.train.base_trainer.TrainingFailedError: The Ray Train run failed. Please inspect the previous error messages for a cause. After fixing the issue (assuming that the error is not caused by your own application logic, but rather an error such as OOM), you can restart the run from scratch or continue this run.
To continue this run, you can use: `trainer = TorchTrainer.restore("/home/ray/ray_results/TorchTrainer_2024-09-11_14-44-13")`.
To start a new run that will retry on training failures, set `train.RunConfig(failure_config=train.FailureConfig(max_failures))` in the Trainer's `run_config` with `max_failures > 0`, or `max_failures = -1` for unlimited retries.
Shared connection to 34.83.226.52 closed.
Error: Command failed:

  ssh -tt -i ~/.ssh/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1cecef3852/32eb62159c/%C -o ControlPersist=10s -o ConnectTimeout=120s ret_raiinmaker_com@34.83.226.52 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker exec -it  ray_nvidia_docker /bin/bash -c '"'"'bash --login -c -i '"'"'"'"'"'"'"'"'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (python /home/ray/Fashion-MNIST-Ray.py)'"'"'"'"'"'"'"'"''"'"' )'

If this issue is due to my ray-cluster-docker.yaml, I will provide it as well so you can take a look:

# An unique identifier for the head node and workers of this cluster.
cluster_name: gpu-docker

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 2
# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5


# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
    #image: "rayproject/ray:latest-gpu"
    image: rayproject/ray-ml:latest-gpu   # use this one if you need ML dependencies, but it's slower to pull
    container_name: "ray_nvidia_docker" # e.g. ray_docker
    
    pull_before_run: True
    run_options:   # Extra options to pass into "docker run"
        - --gpus all
        - --ulimit nofile=65536:65536
    # worker_image: "rayproject/ray-ml:latest"


# Cloud-provider specific configuration.
provider:
    type: gcp
    region: us-west1
    availability_zone: us-west1-b
    project_id: raiinmaker-depin

# How Ray will authenticate with newly launched nodes.
auth:
  ssh_user: ret_raiinmaker_com
  ssh_private_key: ~/.ssh/id_rsa

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray_head_gpu:
        # The resources provided by this node type.
        resources: {"CPU": 2, "GPU": 2}
        # Provider-specific config for the head node, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as subnets and ssh-keys.
        # For more documentation on available fields, see:
        # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
        node_config:
            machineType: n1-standard-2
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 100
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/ml-images/global/images/c0-deeplearning-common-cu121-v20231209-debian-11
            # Make sure to set scheduling->onHostMaintenance to TERMINATE when GPUs are present
            guestAccelerators:
              - acceleratorType: nvidia-tesla-t4
                acceleratorCount: 1
            metadata:
              items:
                - key: install-nvidia-driver
                  value: "True"
            scheduling:
              - onHostMaintenance: TERMINATE

    ray_worker_gpu:
        # The minimum number of nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 1
        # The maximum number of workers nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 2
        # The resources provided by this node type.
        resources: {"CPU": 2, "GPU": 1}
        # Provider-specific config for the head node, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as subnets and ssh-keys.
        # For more documentation on available fields, see:
        # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
        node_config:
            machineType: n1-standard-2
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 50
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/ml-images/global/images/c0-deeplearning-common-cu121-v20231209-debian-11
            # Make sure to set scheduling->onHostMaintenance to TERMINATE when GPUs are present
            guestAccelerators:
              - acceleratorType: nvidia-tesla-t4
                acceleratorCount: 1
            metadata:
              items:
                - key: install-nvidia-driver
                  value: "True"
            # Run workers on preemtible instance by default.
            # Comment this out to use on-demand.
            scheduling:
              - preemptible: true
              - onHostMaintenance: TERMINATE

# Specify the node type of the head node (as configured above).
head_node_type: ray_head_gpu

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

initialization_commands: []
    # Wait until nvidia drivers are installed
    # - >-
    #   timeout 300 bash -c "
    #       command -v nvidia-smi && nvidia-smi
    #       until [ \$? -eq 0 ]; do
    #           command -v nvidia-smi && nvidia-smi
    #       done"

# List of shell commands to run to set up nodes.
# NOTE: rayproject/ray-ml:latest has ray latest bundled
setup_commands:
  - sleep 4
  - sudo apt update
  - sudo apt install -y python3-pip python-is-python3
  - pip install ray[default] google-api-python-client==1.8.0 torch torchvision
# Custom commands that will be run on the head node after common setup.
head_setup_commands:
  - pip install google-api-python-client==1.8.0

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"
rsync_filter:
    - ".gitignore"


# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml
    

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

Finally, will share some code that I saw elsewhere on this discussion board of which I think has some relevance:

import os
import pprint
import ray
ray.init()
pprint.pprint(ray.cluster_resources())
pprint.pprint(os.environ["CUDA_VISIBLE_DEVICES"])

This outputs the following when submitted to the cluster:

2024-09-11 15:12:49,508 INFO util.py:382 -- setting max workers for head node type to 0
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: 34.83.226.52
Shared connection to 34.83.226.52 closed.
Shared connection to 34.83.226.52 closed.
2024-09-11 15:12:59,617 INFO util.py:382 -- setting max workers for head node type to 0
Fetched IP: 34.83.226.52
Shared connection to 34.83.226.52 closed.
2024-09-11 15:13:06,505 INFO worker.py:1585 -- Connecting to existing Ray cluster at address: 10.138.0.44:6379...
2024-09-11 15:13:06,513 INFO worker.py:1761 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 
{'CPU': 2.0,
 'GPU': 2.0,
 'accelerator_type:T4': 1.0,
 'memory': 4382748672.0,
 'node:10.138.0.44': 1.0,
 'node:__internal_head__': 1.0,
 'object_store_memory': 2191374336.0}
Traceback (most recent call last):
  File "/home/ray/CudaTest.py", line 6, in <module>
    pprint.pprint(os.environ["CUDA_VISIBLE_DEVICES"])
  File "/home/ray/anaconda3/lib/python3.9/os.py", line 679, in __getitem__
    raise KeyError(key) from None
KeyError: 'CUDA_VISIBLE_DEVICES'
Shared connection to 34.83.226.52 closed.
Error: Command failed:

  ssh -tt -i ~/.ssh/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_1cecef3852/32eb62159c/%C -o ControlPersist=10s -o ConnectTimeout=120s ret_raiinmaker_com@34.83.226.52 bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker exec -it  ray_nvidia_docker /bin/bash -c '"'"'bash --login -c -i '"'"'"'"'"'"'"'"'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (python /home/ray/CudaTest.py)'"'"'"'"'"'"'"'"''"'"' )'

Finally, I confirl that the cuda driver is in fact in my head node, as when I SSH in and run timeout 300 bash -c " command -v nvidia-smi && nvidia-smi until [ \$? -eq 0 ]; do command -v nvidia-smi && nvidia-smi done"
I get this:

/usr/bin/nvidia-smi
Wed Sep 11 21:31:41 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P8              11W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Please let me know if anybody has any information on how to get this task working on my cluster, it would be greatly appreciated.

Here is the info on my python and library versions (from the head node):
Python = 3.9.19
Torch = 2.0.1
Ray = 2.8.1, but when running it like this docker exec -it ray_nvidia_docker ray --version it outputted this: 2024-09-11 15:24:44,215 - INFO - NumExpr defaulting to 2 threads.
ray, version 2.30.0