Ray starts head node succesfully but no workers (Azure)

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.

I am running a cluster using microsoft Azure using this yaml file:

# An unique identifier for the head node and workers of this cluster.
cluster_name: default

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 4

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty object means disabled.
docker:
    image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
    # image: rayproject/ray:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
    container_name: "ray_container"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: True
    run_options:   # Extra options to pass into "docker run"
        - --ulimit nofile=65536:65536

    # Example of running a GPU head with CPU workers
    # head_image: "rayproject/ray-ml:latest-gpu"
    # Allow Ray to automatically detect GPUs

    # worker_image: "rayproject/ray-ml:latest-cpu"
    # worker_run_options: []

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: azure
    # https://azure.microsoft.com/en-us/global-infrastructure/locations
    location: eastasia
    resource_group: ray-cluster
    # set subscription id otherwise the default from az cli will be used
    # subscription_id: 00000000-0000-0000-0000-000000000000

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
    # you must specify paths to matching private and public key pair files
    # use `ssh-keygen -t rsa -b 4096` to generate a new ssh key pair
    ssh_private_key: ~/.ssh/id_rsa
    # changes to this should match what is specified in file_mounts
    ssh_public_key: ~/.ssh/id_rsa.pub

# More specific customization to node configurations can be made using the ARM template azure-vm-template.json file
# See documentation here: https://docs.microsoft.com/en-us/azure/templates/microsoft.compute/2019-03-01/virtualmachines
# Changes to the local file will be used during deployment of the head node, however worker nodes deployment occurs
# on the head node, so changes to the template must be included in the wheel file used in setup_commands section below

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray.head.default:
        # The resources provided by this node type.
        resources: {"CPU": 1}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_D11_v2
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: Canonical  
                imageOffer: 0001-com-ubuntu-server-focal
                imageSku: 20_04-lts
                imageVersion: latest


    #ray.head.default:
        ## The resources provided by this node type.
        #resources: {"CPU": 2}
        ## Provider-specific config, e.g. instance type.
        #node_config:
            #azure_arm_parameters:
                #vmSize: Standard_D2s_v3
                ## List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                #imagePublisher: Canonical 
                #imageOffer: 0001-com-ubuntu-server-focal
                #imageSku: 20_4-lts-gen2
                #imageVersion: latest

    ray.worker.default:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 1
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 2
        # The resources provided by this node type.
        resources: {"GPU": 1}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_NC12s_v3
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: Canonical
                imageOffer: 0001-com-ubuntu-server-focal
                imageSku: 20_04-lts-gen2
                imageVersion: latest
                # optionally set priority to use Spot instances
                #priority: Spot
                # set a maximum price for spot instances if desired
                # billingProfile:
                #     maxPrice: -1
    ray.worker.default1:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 1
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 2
        # The resources provided by this node type.
        resources: {"GPU": 1}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_NC6s_v3
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: Canonical
                imageOffer: 0001-com-ubuntu-server-focal
                imageSku: 20_4-lts-gen2
                imageVersion: latest
                # optionally set priority to use Spot instances
                #priority: Spot
                # set a maximum price for spot instances if desired
                # billingProfile:
                #     maxPrice: -1

# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
     "~/.ssh/id_rsa.pub": "~/.ssh/id_rsa.pub"
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands:
    - curl -fsSL https://get.docker.com -o get-docker.sh
    - sudo sh get-docker.sh
    - sudo usermod -aG docker $USER
    - sudo systemctl restart docker -f
    # enable docker setup
    - sudo usermod -aG docker $USER || true
    - sleep 10  # delay to avoid docker permission denied errors
    # get rid of annoying Ubuntu message
    - touch ~/.sudo_as_admin_successful
    #- sudo apt update
    #- sudo apt upgrade -y
    #- sudo apt -y install python3-pip
    #- pip install protobuf==3.20.*
    #- pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
    #- pip install pandas
    #- pip install wandb
    #- pip install hyperopt 
    #- sleep 10

# List of shell commands to run to set up nodes.
# NOTE: rayproject/ray-ml:latest has ray latest bundled
setup_commands: [
    # Note: if you're developing Ray, you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl"
     sudo apt update,
     sudo apt upgrade -y,
     sudo apt -y install python3-pip,
     pip install protobuf==3.20.*,
     pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113,
     pip install pandas,
     pip install wandb,
     pip install hyperopt, 
     sleep 10]

# Custom commands that will be run on the head node after common setup.
# NOTE: rayproject/ray-ml:latest has azure packages bundled
head_setup_commands: [conda install -c conda-forge typing_extensions -y]
    # - pip install -U azure-cli-core==2.22.0 azure-mgmt-compute==14.0.0 azure-mgmt-msi==1.0.0 azure-mgmt-network==10.2.0 azure-mgmt-resource==13.0.0

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

head_node: {}
worker_nodes: {}

The cluster however only consists of the head node and no worker nodes. Running ray status gives me this

(base) ray@ray-default-head-3237b1190:~$ ray status --address 10.221.0.6:62345
======== Autoscaler status: 2022-06-28 01:33:00.726622 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node_91a518bfa2356bc6fd9082ced8b2b19e8828f1b4ab3c0311ec979554
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 2.0/2.0 CPU (2.0 used of 2.0 reserved in placement groups)
 0.00/7.289 GiB memory
 0.00/3.644 GiB object_store_memory

Demands:
 {'CPU': 1.0} * 1 (PACK): 1+ pending placement groups
(base) ray@ray-default-head-3237b1190:~$ ray status --address 10.221.0.6:6379
======== Autoscaler status: 2022-06-28 01:39:02.525790 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray.head.default
Pending:
 ray.worker.default, 1 launching
 ray.worker.default1, 1 launching
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/1.0 CPU
 0.00/7.776 GiB memory
 0.00/3.888 GiB object_store_memory

Demands:
 (no resource demands)

As you can see, no errors so I’m unsure of what to do now.

Looks like the worker nodes were still pending, how long have they been stuck in the pending state?

Thanks for your response. It’s still pending now even however I think I know the issue. I have a student account on azure and the Standard_NCs and Standard_NVs sizes are not available for me. I found out by executing this command:

az vm list-skus --output table --size=Standard_N -l eastasia
Where I found that all the virtual machines that were available to me had a quota of 0:

So its a subscription issue not an autoscaler issue.