Cannot launch more than 5 nodes

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

When launching a ray cluster on Azure, I am able to launch 5 workers. However, when launching additional nodes, it seems the agent is killed and attempts to restart forever. The nodes appear as DEAD and keep trying to re-install and restart.

I can see the following in the log of the dead node:

6818[2023-04-03 08:55:27,430 D 82657 82657] (raylet) periodical_runner.cc:26: PeriodicalRunner is destructed6817- The agent is killed by the OS (e.g., out of memory).6816- The agent failed to start because of unexpected error or port conflict. Read the log `cat /tmp/ray/session_latest/dashboard_agent.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
6815- The version of `grpcio` doesn't follow Ray's requirement. Agent can segfault with the incorrect `grpcio` version. Check the grpcio version `pip freeze | grep grpcio`.
6814[2023-04-03 08:55:27,396 E 82657 82855] (raylet) agent_manager.cc:135: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. Agent can fail when
6813[2023-04-03 08:55:27,396 I 82657 82855] (raylet) agent_manager.cc:131: Agent process with id 424238335 exited, exit code 0. ip 10.68.0.6. id 424238335
6812[2023-04-03 08:55:27,375 I 82657 82657] (raylet) io_service_pool.cc:47: IOServicePool is stopped.
6811[2023-04-03 08:55:27,375 D 82657 82657] (raylet) periodical_runner.cc:26: PeriodicalRunner is destructed
6810[2023-04-03 08:55:27,374 I 82657 82657] (raylet) accessor.cc:435: Unregistering node info, node id = 56c4bba9454ab665f05a2fc39350e189765818b0a7b7d9006a5985ca
6809[2023-04-03 08:55:27,374 I 82657 82657] (raylet) main.cc:300: Raylet received SIGTERM, shutting down...

Below is my cluster yaml:

# An unique identifier for the head node and workers of this cluster.
cluster_name: default

# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 10

upscaling_speed: 1.0


# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 20

# Cloud-provider specific configuration.
provider:
    type: azure
    location: westeurope
    resource_group: oren-rg
    cache_stopped_nodes: False

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
    # you must specify paths to matching private and public key pair files
    # use `ssh-keygen -t rsa -b 4096` to generate a new ssh key pair
    ssh_private_key: ~/.ssh/id_rsa
    # changes to this should match what is specified in file_mounts
    ssh_public_key: ~/.ssh/id_rsa.pub

# More specific customization to node configurations can be made using the ARM template azure-vm-template.json file
# See documentation here: https://docs.microsoft.com/en-us/azure/templates/microsoft.compute/2019-03-01/virtualmachines
# Changes to the local file will be used during deployment of the head node, however worker nodes deployment occurs
# on the head node, so changes to the template must be included in the wheel file used in setup_commands section below

# Tell the autoscaler the allowed node types and the resources they provide.
# The key is the name of the node type, which is just for debugging purposes.
# The node config specifies the launch config and physical instance type.
available_node_types:
    ray.head.default:
        # The resources provided by this node type.
        # resources: {"CPU": 0, "memory": 16}
        resources: {"CPU": 0, "memory": 16}
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                vmSize: Standard_B4ms
                vmTags: ['poc']
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-2004
                imageSku: 2004-gen2
                imageVersion: latest

    ray.worker.default:
        # The minimum number of worker nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 7
        # The maximum number of worker nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 7
        # The resources provided by this node type.
        resources:
            CPU: 60
            memory: 450
        # Provider-specific config, e.g. instance type.
        node_config:
            azure_arm_parameters:
                # vmSize: Standard_D8as_v5 #D4_v5
                # vmSize: Standard_F32s_v2
                # vmSize: Standard_D32as_v5
                vmSize: Standard_HB120rs_v2
                # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage
                imagePublisher: microsoft-dsvm
                imageOffer: ubuntu-2004
                imageSku: 2004-gen2
                imageVersion: latest
                # optionally set priority to use Spot instances
                priority: Spot
                # set a maximum price for spot instances if desired
                # billingProfile:
                #     maxPrice: -1

# Specify the node type of the head node (as configured above).
head_node_type: ray.head.default

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
     "~/.ssh/id_rsa.pub": "~/.ssh/id_rsa.pub"
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands:
    # enable docker setup
    # - sudo usermod -aG docker $USER || true
    # - sleep 10  # delay to avoid docker permission denied errors
    # get rid of annoying Ubuntu message
    - touch ~/.sudo_as_admin_successful

# List of shell commands to run to set up nodes.
setup_commands:
    - which ray || pip3 install -U "ray[default]"

# # Custom commands that will be run on the head node after common setup.
head_setup_commands:
    - pip3 install -U azure-cli-core==2.29.1 azure-identity==1.7.0 azure-mgmt-compute==23.1.0 azure-mgmt-network==19.0.0 azure-mgmt-resource==20.0.0 msrestazure==0.6.4

# # Custom commands that will be run on worker nodes after common setup.
# worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - export RAY_BACKEND_LOG_LEVEL=debug
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - export RAY_BACKEND_LOG_LEVEL=debug
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

I have tried destroying and recreating the cluster but receiving the same behavior. I was able to create a larger cluster with smaller machines (less CPU) although I am not sure if this is directly related (perhaps some limit at 300 CPUs?).

Any advice or additional logs to look at?

bump. any advice or further logs to help isolate the issue?

I had a similar issue on AWS and the problem was that my AWS account had a limit on number of vCPU’s that could be used at the same time. Solution was to write to AWS Support and request quota increase. Maybe that’s your case as well?

@zalmane Thanks for @jamil for point the resource limitation caped by your account. You might want to verify the limit for launching number of CPUs or instances.

Let us know if that resolves your problem

hth_and_cheers
Jules