I am following the guide to create a cluster with yaml file on Azure but I don’t get up the workers ot use it.
I am requesting this Standard_NC4as_T4_v3 # 4 núcleos, 28 GB de RAM
on Azure and I have a enough quotas but I always the same error:
(autoscaler +43s) Error: No available node types can fulfill resource request {'CPU': 1.0, 'GPU': 2.0}. Add suitable node types to this cluster to resolve this issue.
This is my yaml file configurations:
cluster_name: default
# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 3
# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0
# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty object means disabled.
docker:
image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
# image: rayproject/ray:latest-gpu # use this one if you don't need ML dependencies, it's faster to pull
container_name: "ray_container"
# If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
# if no cached version is present.
pull_before_run: True
run_options: # Extra options to pass into "docker run"
- --ulimit nofile=65536:65536
# Example of running a GPU head with CPU workers
# head_image: "rayproject/ray-ml:latest-gpu"
# Allow Ray to automatically detect GPUs
# worker_image: "rayproject/ray-ml:latest-cpu"
# worker_run_options: []
# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5
# Cloud-provider specific configuration.
provider:
type: azure
# https://azure.microsoft.com/en-us/global-infrastructure/locations
location: uksouth
resource_group: rg-cluster-ray
# set subscription id otherwise the default from az cli will be used
subscription_id: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: ubuntu
# you must specify paths to matching private and public key pair files
# use `ssh-keygen -t rsa -b 4096` to generate a new ssh key pair
ssh_private_key: ~/.ssh/id_rsa
# changes to this should match what is specified in file_mounts
ssh_public_key: ~/.ssh/id_rsa.pub
# More specific customization to node configurations can be made using the ARM template azure-vm-template.json file
# See documentation here: https://docs.microsoft.com/en-us/azure/templates/microsoft.compute/2019-03-01/virtualmachines
# Changes to the local file will be used during deployment of the head node, however worker nodes deployment occurs
# on the head node, so changes to the template must be included in the wheel file used in setup_commands section below
# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
ray.head.default:
# The resources provided by this node type.
resources: {"CPU": 4, "GPU": 1}
# Provider-specific config, e.g. instance type.
node_config:
azure_arm_parameters:
vmSize: Standard_NC4as_T4_v3 # 4 núcleos, 28 GB de RAM, 176 GB 0.62 USD/h
# List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
imagePublisher: microsoft-dsvm
imageOffer: ubuntu-1804
imageSku: 1804-gen2
imageVersion: latest
ray.worker.default:
# The minimum number of worker nodes of this type to launch.
# This number should be >= 0.
min_workers: 0
# The maximum number of worker nodes of this type to launch.
# This takes precedence over min_workers.
max_workers: 3
# The resources provided by this node type.
resources: {"CPU": 4, "GPU": 1}
# Provider-specific config, e.g. instance type.
node_config:
azure_arm_parameters:
vmSize: Standard_NC4as_T4_v3 # 4 núcleos, 28 GB de RAM, 176 GB 0.62 USD/h
# List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
imagePublisher: microsoft-dsvm
imageOffer: ubuntu-1804
imageSku: 1804-gen2
imageVersion: latest
# optionally set priority to use Spot instances
# priority: Spot
# set a maximum price for spot instances if desired
# billingProfile:
# maxPrice: -1
# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default
# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
# "/path1/on/remote/machine": "/path1/on/local/machine",
# "/path2/on/remote/machine": "/path2/on/local/machine",
"~/.ssh/id_rsa.pub": "~/.ssh/id_rsa.pub"
}
# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []
# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False
# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
- "**/.git"
- "**/.git/**"
# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
- ".gitignore"
# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands:
# enable docker setup
- sudo usermod -aG docker $USER || true
- sleep 10 # delay to avoid docker permission denied errors
# get rid of annoying Ubuntu message
- touch ~/.sudo_as_admin_successful
# List of shell commands to run to set up nodes.
# NOTE: rayproject/ray-ml:latest has ray latest bundled
setup_commands: []
# Note: if you're developing Ray, you probably want to create a Docker image that
# has your Ray repo pre-cloned. Then, you can replace the pip installs
# below with a git checkout <your_sha> (and possibly a recompile).
# To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
# that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
# - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl"
# Custom commands that will be run on the head node after common setup.
# NOTE: rayproject/ray-ml:latest has azure packages bundled
head_setup_commands: []
# - pip install -U azure-cli-core==2.22.0 azure-mgmt-compute==14.0.0 azure-mgmt-msi==1.0.0 azure-mgmt-network==10.2.0 azure-mgmt-resource==13.0.0
# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
I have only changed this lines:
resources: {"CPU": 4, "GPU": 1}
vmSize: Standard_NC4as_T4_v3
- I have commented
# priority: Spot
to assure myself don’t use spot.
I have tested with this experiment:
# -*- coding: utf-8 -*-
from typing import Dict
import pandas as pd
import numpy as np
from transformers import pipeline
from PIL import Image
import datetime as dt
from time import sleep
import ray
ini = dt.datetime.now()
ray.init(address='auto')
# definO Actors
class ImageClassifier:
def __init__(self):
# If doing CPU inference, set `device="cpu"` instead.
self.classifier = pipeline("image-classification", model="google/vit-base-patch16-224")#, device="cuda:0")
def __call__(self, batch: Dict[str, np.ndarray]):
# Convert the numpy array of images into a list of PIL images which is the format the HF pipeline expects.
outputs = self.classifier(
[Image.fromarray(image_array) for image_array in batch["image"]],
top_k=1,
batch_size=BATCH_SIZE)
# `outputs` is a list of length-one lists. For example:
# [[{'score': '...', 'label': '...'}], ..., [{'score': '...', 'label': '...'}]]
batch["score"] = [output[0]["score"] for output in outputs]
batch["label"] = [output[0]["label"] for output in outputs]
return batch
# Descargo dataset publico de ejemplo
s3_uri = "s3://anonymous@air-example-data-2/imagenette2/val/"
ds = ray.data.read_images(
s3_uri, mode="RGB"
)
# Defino el tamaño de lote y la concurrencia
# Pick the largest batch size that can fit on our GPUs.
# If doing CPU inference you might need to lower considerably (e.g. to 10).
BATCH_SIZE = 10
resultado = []
imagenes = 40
gpus = 2
actors = 2
# (5) Mapeo el modelo a los datos para la paralelizacion
predictions = ds.map_batches(
ImageClassifier,
concurrency=actors, # Use 3 GPUs. Change this number based on the number of GPUs in your cluster.
num_gpus=gpus, # Specify 1 GPU per model replica. If doing CPU inference, set to 0.
batch_size=BATCH_SIZE # Use batch size from above.
)
ini2 = dt.datetime.now()
prediction_batch = predictions.take_batch(imagenes)
fin2 = dt.datetime.now()
tiempo_ejecucion_inferencia = (fin2-ini2).total_seconds()
# informacion del cluster
print(ray.cluster_resources())
print("*"*100)
print(ray.available_resources())
print("*"*100)
print(f" - Inferencia de: {imagenes} paralelizadas en {gpus} GPUs el tiempo de ejecucion es: {tiempo_ejecucion_inferencia} segundos")
resultado.append({"num_gpu":gpus, "num_imagenes":imagenes, 'tiempo_ejecucion':tiempo_ejecucion_inferencia})
sleep(5)
df = pd.DataFrame(resultado).pivot_table(values="tiempo_ejecucion",index="num_imagenes",columns="num_gpu")
print(df.head(10))
fin = dt.datetime.now()
print("Duracion total del experimento: ", fin-ini)
I never see two active GPUs and get this error:
(autoscaler +43s) Error: No available node types can fulfill resource request {'CPU': 1.0, 'GPU': 2.0}. Add suitable node types to this cluster to resolve this issue.
This is the log about the generation of the cluster:
(ray) eduardo@erl:~/Documentos/ray_cluster$ ray up config/my_config_cluster.yaml
Cluster: default
2024-04-04 10:55:13,764 INFO util.py:382 -- setting max workers for head node type to 0
Checking Azure environment settings
2024-04-04 10:55:14,236 INFO config.py:52 -- Using subscription id: xxxxxxxxxxxxxxxxxxxxxxxxxxx
2024-04-04 10:55:14,236 INFO config.py:67 -- Creating/Updating resource group: rg-cluster-ray
2024-04-04 10:55:14,941 - INFO - AzureCliCredential.get_token succeeded
2024-04-04 10:55:15,992 INFO config.py:79 -- Using cluster name: default
2024-04-04 10:55:15,992 INFO config.py:90 -- Using unique id: d932
2024-04-04 10:55:15,993 INFO config.py:98 -- Using subnet mask: 10.112.0.0/16
2024-04-04 10:55:47,351 - INFO - No environment configuration found.
2024-04-04 10:55:47,367 - INFO - ManagedIdentityCredential will use IMDS
2024-04-04 10:55:49,323 - INFO - DefaultAzureCredential acquired a token from AzureCliCredential
No head node found. Launching a new cluster. Confirm [y/N]: y
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
Acquiring an up-to-date head node
2024-04-04 10:55:54,636 INFO node_provider.py:195 -- Reusing nodes []. To disable reuse, set `cache_stopped_nodes: False` under `provider` in the cluster configuration.
2024-04-04 10:55:55,408 - INFO - AzureCliCredential.get_token succeeded
2024-04-04 10:55:55,408 - INFO - DefaultAzureCredential acquired a token from AzureCliCredential
Launched a new head node
Fetching the new head node
2024-04-04 10:57:28,795 - INFO - AzureCliCredential.get_token succeeded
2024-04-04 10:57:28,795 - INFO - DefaultAzureCredential acquired a token from AzureCliCredential
<1/1> Setting up head node
Prepared bootstrap config
New status: waiting-for-ssh
[1/7] Waiting for SSH to become available
Running `uptime` as a test.
Fetched IP: 172.166.122.51
ssh: connect to host 172.166.122.51 port 22: Connection refused
SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 172.166.122.51 port 22: Connection refused
SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 172.166.122.51 port 22: Connection refused
SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host 172.166.122.51 port 22: Connection refused
SSH still not available (SSH command failed.), retrying in 5 seconds.
Warning: Permanently added '172.166.122.51' (ED25519) to the list of known hosts.
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.
08:58:29 up 1 min, 1 user, load average: 3.30, 1.00, 0.35
Shared connection to 172.166.122.51 closed.
Success.
Updating cluster configuration. [hash=1e0c71e9f1af416f0d01fff2f51eb4559104fe93]
New status: syncing-files
[2/7] Processing file mounts
Shared connection to 172.166.122.51 closed.
~/.ssh/id_rsa.pub from /home/eduardo/.ssh/id_rsa.pub
Shared connection to 172.166.122.51 closed.
Shared connection to 172.166.122.51 closed.
[3/7] No worker file mounts to sync
New status: setting-up
[4/7] Running initialization commands
Warning: Permanently added '172.166.122.51' (ED25519) to the list of known hosts.
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.
Connection to 172.166.122.51 closed.
Warning: Permanently added '172.166.122.51' (ED25519) to the list of known hosts.
Connection to 172.166.122.51 closed.
Warning: Permanently added '172.166.122.51' (ED25519) to the list of known hosts.
Connection to 172.166.122.51 closed.
[5/7] Initializing command runner
Warning: Permanently added '172.166.122.51' (ED25519) to the list of known hosts.
Shared connection to 172.166.122.51 closed.
latest-gpu: Pulling from rayproject/ray-ml
7a2c55901189: Pull complete
09d415c238d7: Pull complete
9fe6e2e61518: Pull complete
41f16248e682: Pull complete
95d7b7817039: Pull complete
8f6c90485347: Pull complete
ab17245097e4: Pull complete
dfecd7e9912b: Pull complete
464a8f745445: Pull complete
d67c111fa588: Pull complete
0723abde19c0: Pull complete
b3b26ce36551: Pull complete
71ac5783bd6d: Pull complete
a63f6b91b260: Pull complete
4f4fb700ef54: Pull complete
55dcf61a559c: Pull complete
75eb95dfd429: Pull complete
1681c00ae96f: Pull complete
ef15eba5a8e6: Pull complete
82db9a4b8c3c: Pull complete
c1bea788c61b: Pull complete
ac25f858e400: Pull complete
81b3a73e5971: Pull complete
235971a4d5ed: Pull complete
b78d725e5848: Pull complete
Digest: sha256:052e1e8c1d7c16effaba16b7bb1a630f9429acd7124faea9992c99ce83d516e7
Status: Downloaded newer image for rayproject/ray-ml:latest-gpu
docker.io/rayproject/ray-ml:latest-gpu
Shared connection to 172.166.122.51 closed.
Shared connection to 172.166.122.51 closed.
Shared connection to 172.166.122.51 closed.
Shared connection to 172.166.122.51 closed.
Shared connection to 172.166.122.51 closed.
Thu Apr 4 09:12:39 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000001:00:00.0 Off | Off |
| N/A 32C P8 9W / 70W | 0MiB / 16127MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Shared connection to 172.166.122.51 closed.
9b9defc834c056ce7b573637b51012e7447027cc0ea675ca62939c4d807cd719
Shared connection to 172.166.122.51 closed.
Shared connection to 172.166.122.51 closed.
sending incremental file list
ray_bootstrap_config.yaml
sent 1,712 bytes received 35 bytes 3,494.00 bytes/sec
total size is 3,955 speedup is 2.26
Shared connection to 172.166.122.51 closed.
Shared connection to 172.166.122.51 closed.
sending incremental file list
ray_bootstrap_key.pem
sent 2,222 bytes received 35 bytes 1,504.67 bytes/sec
total size is 3,381 speedup is 1.50
Shared connection to 172.166.122.51 closed.
Shared connection to 172.166.122.51 closed.
[6/7] No setup commands to run.
[7/7] Starting the Ray runtime
Did not find any active Ray processes.
Shared connection to 172.166.122.51 closed.
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
Local node IP: 10.112.0.4
--------------------
Ray runtime started.
--------------------
Next steps
To add another node to this Ray cluster, run
ray start --address='10.112.0.4:6379'
To connect to this Ray cluster:
import ray
ray.init()
To submit a Ray job using the Ray Jobs CLI:
RAY_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py
See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html
for more information on submitting Ray jobs to the Ray cluster.
To terminate the Ray runtime, run
ray stop
To view the status of the cluster, use
ray status
To monitor and debug Ray, view the dashboard at
127.0.0.1:8265
If connection to the dashboard fails, check your firewall settings and network configuration.
Shared connection to 172.166.122.51 closed.
New status: up-to-date
Useful commands:
To terminate the cluster:
ray down /home/eduardo/Documentos/ray_cluster/config/my_config_cluster.yaml
To retrieve the IP address of the cluster head:
ray get-head-ip /home/eduardo/Documentos/ray_cluster/config/my_config_cluster.yaml
To port-forward the cluster's Ray Dashboard to the local machine:
ray dashboard /home/eduardo/Documentos/ray_cluster/config/my_config_cluster.yaml
To submit a job to the cluster, port-forward the Ray Dashboard in another terminal and run:
ray job submit --address http://localhost:<dashboard-port> --working-dir . -- python my_script.py
To connect to a terminal on the cluster head for debugging:
ray attach /home/eduardo/Documentos/ray_cluster/config/my_config_cluster.yaml
To monitor autoscaling:
ray exec /home/eduardo/Documentos/ray_cluster/config/my_config_cluster.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
and when I see the resources created on azure web here are only resources associated with the head node and no things about the workers.
thanks in advance