Azure GPU machine not getting configured

Hi Team, i am trying to spin up a ray cluster with below config, the device is not having GPU machine even though the below vm points to a nvidia A100 machine

node_config:
azure_arm_parameters:
vmSize: Standard_NC24ads_A100_v4
imagePublisher: canonical
imageOffer: 0001-com-ubuntu-server-jammy
imageSku: 22_04-lts-gen2
imageVersion: latest

Logs:
[2025-03-20T04:34:18.175+0000] {logging_mixin.py:190} INFO - Device Detected: cpu

Hi! From a glance, yea I feel like it should be officially supporting the GPUs. Here are some debugging questions:

  1. Do you have the right drivers installed, like CUDA, so the GPUs are being recognized correctly?
  2. In your Ray configuration, is your num-gpus variable set correctly to the # of GPUs you have?
  3. Is the CUDA_VISIBLE_DEVICES environment variable set correctly? You can read more about it here: CUDA Pro Tip: Control GPU Visibility with CUDA_VISIBLE_DEVICES | NVIDIA Technical Blog

Are there any other error messages or is it just printing out that it can only find the cpu?

I am trying to run a model compilation step , but since gpu is not available its failing , here is the complete ray config

[2025-03-25T20:50:01.705+0000] {python.py:240} INFO - Done. Returned value was: {'ray_config': {'auth': {'ssh_private_key': '***', 'ssh_public_key': '/opt/airflow/.secrets/ssh-keys/ssh-publickey', 'ssh_user': 'ubuntu'}, 'available_node_types': {'ray.head.default': {'max_workers': 12, 'min_workers': 0, 'node_config': {'azure_arm_parameters': {'imageOffer': '0001-com-ubuntu-server-jammy', 'imagePublisher': 'canonical', 'imageSku': '22_04-lts-gen2', 'imageVersion': 'latest', 'vmSize': 'Standard_NC24ads_A100_v4'}}, 'resources': {'GPU': 1}}, 'ray.worker.default': {'max_workers': 12, 'min_workers': 0, 'node_config': {'azure_arm_parameters': {'imageOffer': '0001-com-ubuntu-server-jammy', 'imagePublisher': 'canonical', 'imageSku': '22_04-lts-gen2', 'imageVersion': 'latest', 'vmSize': 'Standard_NC6s_v3'}}, 'resources': {'GPU': 1}}}, 'cluster_name': 'job-qwsr', 'cluster_synced_files': [], 'file_mounts': {'/home/ubuntu/.secrets': '***', '/home/ubuntu/.ssh/id_rsa': '/opt/airflow/.secrets/ssh-keys/ssh-privatekey', '/home/ubuntu/.ssh/id_rsa.pub': '/opt/airflow/.secrets/ssh-keys/ssh-publickey'}, 'file_mounts_sync_continuously': False, 'head_node_type': 'ray.head.default', 'head_setup_commands': ['pip install --upgrade pip', 'pip install azure-cli azure-identity', 'chmod 600 /home/ubuntu/.secrets/ssh-privatekey', 'rm ~/.ssh/id_rsa', 'cp /home/ubuntu/.secrets/ssh-privatekey ~/.ssh/id_rsa', 'chmod 600 /home/ubuntu/.secrets/ssh-publickey', 'rm ~/.ssh/id_rsa.pub', 'cp /home/ubuntu/.secrets/ssh-publickey ~/.ssh/id_rsa.pub', 'chmod 600 /home/ubuntu/.ssh/*', "git config --global core.sshCommand 'ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no'", 'git clone git@github.com:my-team/compiler.git', "echo 'alias python=python3' >> ~/.bashrc && source ~/.bashrc", 'pip3 install poetry', 'cd compiler && poetry env use system  && rm poetry.lock && poetry lock && poetry install', 'pip install pyopenssl --upgrade', "python -m pip install -U 'ray[default]'"], 'head_start_ray_commands': ['ray stop', 'ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml'], 'idle_timeout_minutes': 5, 'initialization_commands': [], 'max_workers': 7, 'provider': {'location': 'centralus', 'resource_group': 'raycluster-job-0a13fbb4ea8a499b95e3ed0ca5391482', 'subscription_id': '<sub-id>', 'type': 'azure'}, 'rsync_exclude': ['**/.git', '**/.git/**'], 'rsync_filter': ['.gitignore'], 'setup_commands': ['sudo apt update && sudo apt install -y python3-pip'], 'upscaling_speed': 1.0, 'worker_setup_commands': [], 'worker_start_ray_commands': ['ray stop', 'ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076']}, 'machine_config': None}

Is there something i am missing wrt to azure because a similar setup for aws works fine

Any leads would be helpful