Create a cluster on azure the workers node no available when resource request

Eduardo_Romero_Lopez · April 4, 2024, 10:44am

I am following the guide to create a cluster with yaml file on Azure but I don’t get up the workers ot use it.

I am requesting this Standard_NC4as_T4_v3 # 4 núcleos, 28 GB de RAM on Azure and I have a enough quotas but I always the same error:

(autoscaler +43s) Error: No available node types can fulfill resource request {'CPU': 1.0, 'GPU': 2.0}. Add suitable node types to this cluster to resolve this issue.

This is my yaml file configurations:

cluster_name: default

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 3

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty object means disabled.
docker:
    image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
    # image: rayproject/ray:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
    container_name: "ray_container"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: True
    run_options:   # Extra options to pass into "docker run"
        - --ulimit nofile=65536:65536

    # Example of running a GPU head with CPU workers
    # head_image: "rayproject/ray-ml:latest-gpu"
    # Allow Ray to automatically detect GPUs

    # worker_image: "rayproject/ray-ml:latest-cpu"
    # worker_run_options: []

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: azure
    # https://azure.microsoft.com/en-us/global-infrastructure/locations
    location: uksouth
    resource_group: rg-cluster-ray
    # set subscription id otherwise the default from az cli will be used
    subscription_id: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
    # you must specify paths to matching private and public key pair files
    # use `ssh-keygen -t rsa -b 4096` to generate a new ssh key pair
    ssh_private_key: ~/.ssh/id_rsa
    # changes to this should match what is specified in file_mounts
    ssh_public_key: ~/.ssh/id_rsa.pub

# More specific customization to node configurations can be made using the ARM template azure-vm-template.json file
# See documentation here: https://docs.microsoft.com/en-us/azure/templates/microsoft.compute/2019-03-01/virtualmachines
# Changes to the local file will be used during deployment of the head node, however worker nodes deployment occurs
# on the head node, so changes to the template must be included in the wheel file used in setup_commands section below

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray.head.default:
        # The resources provided by this node type.
        resources: {"CPU": 4, "GPU": 1}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_NC4as_T4_v3 # 4 núcleos, 28 GB de RAM, 176 GB 0.62 USD/h
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: latest

    ray.worker.default:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 3
        # The resources provided by this node type.
        resources: {"CPU": 4, "GPU": 1}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_NC4as_T4_v3 # 4 núcleos, 28 GB de RAM, 176 GB 0.62 USD/h
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-1804
                imageSku: 1804-gen2
                imageVersion: latest
                # optionally set priority to use Spot instances
                # priority: Spot
                # set a maximum price for spot instances if desired
                # billingProfile:
                #     maxPrice: -1

# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
     "~/.ssh/id_rsa.pub": "~/.ssh/id_rsa.pub"
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands:
    # enable docker setup
    - sudo usermod -aG docker $USER || true
    - sleep 10  # delay to avoid docker permission denied errors
    # get rid of annoying Ubuntu message
    - touch ~/.sudo_as_admin_successful

# List of shell commands to run to set up nodes.
# NOTE: rayproject/ray-ml:latest has ray latest bundled
setup_commands: []
    # Note: if you're developing Ray, you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl"

# Custom commands that will be run on the head node after common setup.
# NOTE: rayproject/ray-ml:latest has azure packages bundled
head_setup_commands: []
    # - pip install -U azure-cli-core==2.22.0 azure-mgmt-compute==14.0.0 azure-mgmt-msi==1.0.0 azure-mgmt-network==10.2.0 azure-mgmt-resource==13.0.0

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

I have only changed this lines:

resources: {"CPU": 4, "GPU": 1}
vmSize: Standard_NC4as_T4_v3
I have commented # priority: Spot to assure myself don’t use spot.

I have tested with this experiment:

# -*- coding: utf-8 -*-

from typing import Dict
import pandas as pd
import numpy as np
from transformers import pipeline
from PIL import Image
import datetime as dt
from time import sleep
import ray

ini = dt.datetime.now()

ray.init(address='auto')

# definO Actors
class ImageClassifier:
    def __init__(self):
        # If doing CPU inference, set `device="cpu"` instead.
        self.classifier = pipeline("image-classification", model="google/vit-base-patch16-224")#, device="cuda:0")

    def __call__(self, batch: Dict[str, np.ndarray]):
        # Convert the numpy array of images into a list of PIL images which is the format the HF pipeline expects.
        outputs = self.classifier(
            [Image.fromarray(image_array) for image_array in batch["image"]], 
            top_k=1, 
            batch_size=BATCH_SIZE)
        
        # `outputs` is a list of length-one lists. For example:
        # [[{'score': '...', 'label': '...'}], ..., [{'score': '...', 'label': '...'}]]
        batch["score"] = [output[0]["score"] for output in outputs]
        batch["label"] = [output[0]["label"] for output in outputs]
        return batch

# Descargo dataset publico de ejemplo
s3_uri = "s3://anonymous@air-example-data-2/imagenette2/val/"
ds = ray.data.read_images(
    s3_uri, mode="RGB"
)

# Defino el tamaño de lote y la concurrencia
# Pick the largest batch size that can fit on our GPUs.
# If doing CPU inference you might need to lower considerably (e.g. to 10).
BATCH_SIZE = 10
resultado = [] 
imagenes = 40
gpus = 2
actors = 2

# (5) Mapeo el modelo a los datos para la paralelizacion
predictions = ds.map_batches(
    ImageClassifier,
    concurrency=actors, # Use 3 GPUs. Change this number based on the number of GPUs in your cluster.
    num_gpus=gpus,  # Specify 1 GPU per model replica. If doing CPU inference, set to 0.
    batch_size=BATCH_SIZE # Use batch size from above.
)

ini2 = dt.datetime.now()
prediction_batch = predictions.take_batch(imagenes)
fin2 = dt.datetime.now()
tiempo_ejecucion_inferencia = (fin2-ini2).total_seconds()

# informacion del cluster
print(ray.cluster_resources())
print("*"*100)
print(ray.available_resources())
print("*"*100)
print(f" - Inferencia de: {imagenes} paralelizadas en {gpus} GPUs el tiempo de ejecucion es: {tiempo_ejecucion_inferencia} segundos")

resultado.append({"num_gpu":gpus, "num_imagenes":imagenes, 'tiempo_ejecucion':tiempo_ejecucion_inferencia})
sleep(5)

df = pd.DataFrame(resultado).pivot_table(values="tiempo_ejecucion",index="num_imagenes",columns="num_gpu")
print(df.head(10))

fin = dt.datetime.now()
print("Duracion total del experimento: ", fin-ini)

I never see two active GPUs and get this error:

(autoscaler +43s) Error: No available node types can fulfill resource request {'CPU': 1.0, 'GPU': 2.0}. Add suitable node types to this cluster to resolve this issue.

This is the log about the generation of the cluster:

(ray) eduardo@erl:~/Documentos/ray_cluster$ ray up config/my_config_cluster.yaml
Cluster: default

2024-04-04 10:55:13,764 INFO util.py:382 -- setting max workers for head node type to 0
Checking Azure environment settings
2024-04-04 10:55:14,236 INFO config.py:52 -- Using subscription id: xxxxxxxxxxxxxxxxxxxxxxxxxxx
2024-04-04 10:55:14,236 INFO config.py:67 -- Creating/Updating resource group: rg-cluster-ray
2024-04-04 10:55:14,941 - INFO - AzureCliCredential.get_token succeeded
2024-04-04 10:55:15,992 INFO config.py:79 -- Using cluster name: default
2024-04-04 10:55:15,992 INFO config.py:90 -- Using unique id: d932
2024-04-04 10:55:15,993 INFO config.py:98 -- Using subnet mask: 10.112.0.0/16
2024-04-04 10:55:47,351 - INFO - No environment configuration found.
2024-04-04 10:55:47,367 - INFO - ManagedIdentityCredential will use IMDS
2024-04-04 10:55:49,323 - INFO - DefaultAzureCredential acquired a token from AzureCliCredential
No head node found. Launching a new cluster. Confirm [y/N]: y

Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Acquiring an up-to-date head node
2024-04-04 10:55:54,636 INFO node_provider.py:195 -- Reusing nodes []. To disable reuse, set `cache_stopped_nodes: False` under `provider` in the cluster configuration.
2024-04-04 10:55:55,408 - INFO - AzureCliCredential.get_token succeeded
2024-04-04 10:55:55,408 - INFO - DefaultAzureCredential acquired a token from AzureCliCredential
  Launched a new head node
  Fetching the new head node
2024-04-04 10:57:28,795 - INFO - AzureCliCredential.get_token succeeded
2024-04-04 10:57:28,795 - INFO - DefaultAzureCredential acquired a token from AzureCliCredential
  
<1/1> Setting up head node
  Prepared bootstrap config
  New status: waiting-for-ssh
  [1/7] Waiting for SSH to become available
    Running `uptime` as a test.
    Fetched IP: 172.166.122.51
ssh: connect to host 172.166.122.51 port 22: Connection refused
    SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 172.166.122.51 port 22: Connection refused
    SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 172.166.122.51 port 22: Connection refused
    SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 172.166.122.51 port 22: Connection refused
    SSH still not available (SSH command failed.), retrying in 5 seconds.
Warning: Permanently added '172.166.122.51' (ED25519) to the list of known hosts.
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

 08:58:29 up 1 min,  1 user,  load average: 3.30, 1.00, 0.35
Shared connection to 172.166.122.51 closed.
    Success.
  Updating cluster configuration. [hash=1e0c71e9f1af416f0d01fff2f51eb4559104fe93]
  New status: syncing-files
  [2/7] Processing file mounts
Shared connection to 172.166.122.51 closed.
    ~/.ssh/id_rsa.pub from /home/eduardo/.ssh/id_rsa.pub
Shared connection to 172.166.122.51 closed.
Shared connection to 172.166.122.51 closed.
  [3/7] No worker file mounts to sync
  New status: setting-up
  [4/7] Running initialization commands
Warning: Permanently added '172.166.122.51' (ED25519) to the list of known hosts.
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

Connection to 172.166.122.51 closed.
Warning: Permanently added '172.166.122.51' (ED25519) to the list of known hosts.
Connection to 172.166.122.51 closed.
Warning: Permanently added '172.166.122.51' (ED25519) to the list of known hosts.
Connection to 172.166.122.51 closed.
  [5/7] Initializing command runner
Warning: Permanently added '172.166.122.51' (ED25519) to the list of known hosts.
Shared connection to 172.166.122.51 closed.
latest-gpu: Pulling from rayproject/ray-ml
7a2c55901189: Pull complete 
09d415c238d7: Pull complete 
9fe6e2e61518: Pull complete 
41f16248e682: Pull complete 
95d7b7817039: Pull complete 
8f6c90485347: Pull complete 
ab17245097e4: Pull complete 
dfecd7e9912b: Pull complete 
464a8f745445: Pull complete 
d67c111fa588: Pull complete 
0723abde19c0: Pull complete 
b3b26ce36551: Pull complete 
71ac5783bd6d: Pull complete 
a63f6b91b260: Pull complete 
4f4fb700ef54: Pull complete 
55dcf61a559c: Pull complete 
75eb95dfd429: Pull complete 
1681c00ae96f: Pull complete 
ef15eba5a8e6: Pull complete 
82db9a4b8c3c: Pull complete 
c1bea788c61b: Pull complete 
ac25f858e400: Pull complete 
81b3a73e5971: Pull complete 
235971a4d5ed: Pull complete 
b78d725e5848: Pull complete 
Digest: sha256:052e1e8c1d7c16effaba16b7bb1a630f9429acd7124faea9992c99ce83d516e7
Status: Downloaded newer image for rayproject/ray-ml:latest-gpu
docker.io/rayproject/ray-ml:latest-gpu
Shared connection to 172.166.122.51 closed.
Shared connection to 172.166.122.51 closed.
Shared connection to 172.166.122.51 closed.
Shared connection to 172.166.122.51 closed.
Shared connection to 172.166.122.51 closed.
Thu Apr  4 09:12:39 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000001:00:00.0 Off |                  Off |
| N/A   32C    P8     9W /  70W |      0MiB / 16127MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
Shared connection to 172.166.122.51 closed.
9b9defc834c056ce7b573637b51012e7447027cc0ea675ca62939c4d807cd719
Shared connection to 172.166.122.51 closed.
Shared connection to 172.166.122.51 closed.
sending incremental file list
ray_bootstrap_config.yaml

sent 1,712 bytes  received 35 bytes  3,494.00 bytes/sec
total size is 3,955  speedup is 2.26
Shared connection to 172.166.122.51 closed.
Shared connection to 172.166.122.51 closed.
sending incremental file list
ray_bootstrap_key.pem

sent 2,222 bytes  received 35 bytes  1,504.67 bytes/sec
total size is 3,381  speedup is 1.50
Shared connection to 172.166.122.51 closed.
Shared connection to 172.166.122.51 closed.
  [6/7] No setup commands to run.
  [7/7] Starting the Ray runtime
Did not find any active Ray processes.
Shared connection to 172.166.122.51 closed.
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Local node IP: 10.112.0.4

--------------------
Ray runtime started.
--------------------

Next steps
  To add another node to this Ray cluster, run
    ray start --address='10.112.0.4:6379'
  
  To connect to this Ray cluster:
    import ray
    ray.init()
  
  To submit a Ray job using the Ray Jobs CLI:
    RAY_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py
  
  See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html 
  for more information on submitting Ray jobs to the Ray cluster.
  
  To terminate the Ray runtime, run
    ray stop
  
  To view the status of the cluster, use
    ray status
  
  To monitor and debug Ray, view the dashboard at 
    127.0.0.1:8265
  
  If connection to the dashboard fails, check your firewall settings and network configuration.
Shared connection to 172.166.122.51 closed.
  New status: up-to-date

Useful commands:
  To terminate the cluster:
    ray down /home/eduardo/Documentos/ray_cluster/config/my_config_cluster.yaml
  
  To retrieve the IP address of the cluster head:
    ray get-head-ip /home/eduardo/Documentos/ray_cluster/config/my_config_cluster.yaml
  
  To port-forward the cluster's Ray Dashboard to the local machine:
    ray dashboard /home/eduardo/Documentos/ray_cluster/config/my_config_cluster.yaml
  
  To submit a job to the cluster, port-forward the Ray Dashboard in another terminal and run:
    ray job submit --address http://localhost:<dashboard-port> --working-dir . -- python my_script.py
  
  To connect to a terminal on the cluster head for debugging:
    ray attach /home/eduardo/Documentos/ray_cluster/config/my_config_cluster.yaml
  
  To monitor autoscaling:
    ray exec /home/eduardo/Documentos/ray_cluster/config/my_config_cluster.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'

and when I see the resources created on azure web here are only resources associated with the head node and no things about the workers.

thanks in advance

Topic		Replies	Views
Ray starts head node succesfully but no workers (Azure) Ray Clusters	2	585	June 29, 2022
Cluster configuration on Azure running docker containers Ray Clusters	4	936	April 21, 2022
AttributeError: 'DefaultAzureCredential' object has no attribute 'signed_session' when Monitoring Azure Ray Cluster Ray Clusters	8	2121	August 10, 2022
Ray Cluster seem to be spawning less nodes than it should Ray Clusters	8	299	August 28, 2024
Ray Cluster on Azure with runtime_env=docker help Ray Core	1	405	March 31, 2022

Create a cluster on azure the workers node no available when resource request

Related topics