Ray cluster worker nodes stuck at uninitialized

The ray cluster setup by ray up cluster.yaml most of the time gets stuck at starting the worker nodes with only the head nodes start up and occasionally starts the cluster with all the nodes started successfully when the cluster is either restarted by down down and then ray up or by ray up cluster.yaml --restart-only

below is the monitor.out log

Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
2023-05-05 18:16:16,262 INFO node_provider.py:54 -- ClusterState: Loaded cluster state: ['10.60.62.65', '172.16.30.136', '172.16.30.130']
Fetched IP: 172.16.30.130
Warning: Permanently added '172.16.30.130' (ECDSA) to the list of known hosts.
==> /tmp/ray/session_latest/logs/monitor.err <==

==> /tmp/ray/session_latest/logs/monitor.log <==
2023-05-05 18:15:10,981 INFO monitor.py:651 -- Starting monitor using ray installation: /usr/local/lib/python3.8/dist-packages/ray/__init__.py
2023-05-05 18:15:10,981 INFO monitor.py:652 -- Ray version: 2.3.0
2023-05-05 18:15:10,981 INFO monitor.py:653 -- Ray commit: cf7a56b4b0b648c324722df7c99c168e92ff0b45
2023-05-05 18:15:10,981 INFO monitor.py:654 -- Monitor started with command: ['/usr/local/lib/python3.8/dist-packages/ray/autoscaler/_private/monitor.py', '--logs-dir=/tmp/ray/session_2023-05-05_18-15-09_523603_127/logs', '--logging-rotate-bytes=536870912', '--logging-rotate-backup-count=5', '--gcs-address=172.16.30.130:1234', '--autoscaling-config=/root/ray_bootstrap_config.yaml', '--monitor-ip=172.16.30.130']
2023-05-05 18:15:10,984 INFO monitor.py:167 -- session_name: session_2023-05-05_18-15-09_523603_127
2023-05-05 18:15:10,985 INFO monitor.py:198 -- Starting autoscaler metrics server on port 44217
2023-05-05 18:15:10,986 INFO monitor.py:218 -- Monitor: Started
2023-05-05 18:15:11,007 INFO node_provider.py:53 -- ClusterState: Loaded cluster state: ['10.60.62.65', '172.16.30.136', '172.16.30.130']
2023-05-05 18:15:11,007 INFO autoscaler.py:276 -- disable_node_updaters:False
2023-05-05 18:15:11,007 INFO autoscaler.py:284 -- disable_launch_config_check:False
2023-05-05 18:15:11,007 INFO autoscaler.py:296 -- foreground_node_launch:False
2023-05-05 18:15:11,007 INFO autoscaler.py:306 -- worker_liveness_check:True
2023-05-05 18:15:11,007 INFO autoscaler.py:314 -- worker_rpc_drain:True
2023-05-05 18:15:11,008 INFO autoscaler.py:364 -- StandardAutoscaler: {'cluster_name': 'default', 'auth': {'ssh_user': 'car'}, 'upscaling_speed': 1.0, 'idle_timeout_minutes': 5, 'docker': {'image': 'test', 'container_name': 'ray_container', 'pull_before_run': False, 'run_options': ['--ulimit nofile=65536:65536', '--shm-size=11gb', '--device=/dev/dri:/dev/dri', '--env="DISPLAY"', '--expose 22', '--expose 8265', '--cap-add SYS_PTRACE', '--env-file=$HOME/.env', "$(nvidia-smi >> null && echo --gpus all || echo '')", '--volume=$HOME/.ssh/:/root/.ssh', '--volume=$SHARED_VOLUME:$SHARED_VOLUME', '--volume=/tmp/ray_logs:/tmp', '--volume=$HOME/car_logs:/logs']}, 'initialization_commands': [], 'setup_commands': [], 'head_setup_commands': [], 'worker_setup_commands': [], 'head_start_ray_commands': ['ray stop', 'ulimit -c unlimited && ray start --head --dashboard-host=0.0.0.0 --port=1234 --autoscaling-config=~/ray_bootstrap_config.yaml', "ln -sfn $(readlink -f /tmp/ray/session_latest | cut -d'/' -f4-) /tmp/ray/session_latest", '/prometheus-2.42.0.linux-amd64/prometheus --config.file=/tmp/ray/session_latest/metrics/prometheus/prometheus.yml &', 'grafana-server -homepath /usr/share/grafana --config /tmp/ray/session_latest/metrics/grafana/grafana.ini web &'], 'worker_start_ray_commands': ['ray stop', 'ray start --address=$RAY_HEAD_IP:1234 --resources=\'{"work\'${CAR}${NODEID}\'":\'${WORKRES}\',"det\'${CAR}${NODEID}\'":\'${DETRES}\',"feat\'${CAR}${NODEID}\'":\'${FEATRES}\'}\'', "ln -sfn $(readlink -f /tmp/ray/session_latest | cut -d'/' -f4-) /tmp/ray/session_latest"], 'file_mounts': {}, 'cluster_synced_files': [], 'file_mounts_sync_continuously': False, 'rsync_exclude': ['**/.git', '**/.git/**'], 'rsync_filter': ['.gitignore'], 'provider': {'type': 'local', 'head_ip': '172.16.30.130', 'worker_ips': ['10.60.62.65', '172.16.30.136']}, 'max_workers': 2, 'available_node_types': {'local.cluster.node': {'node_config': {}, 'resources': {}, 'min_workers': 2, 'max_workers': 2}}, 'head_node_type': 'local.cluster.node', 'no_restart': False}
2023-05-05 18:15:11,009 INFO monitor.py:388 -- Autoscaler has not yet received load metrics. Waiting.
2023-05-05 18:15:16,030 INFO autoscaler.py:143 -- The autoscaler took 0.001 seconds to fetch the list of non-terminated nodes.
2023-05-05 18:15:16,031 INFO autoscaler.py:419 -- 
======== Autoscaler status: 2023-05-05 18:15:16.031852 ========
Node status
---------------------------------------------------------------
Healthy:
 1 local.cluster.node
Pending:
 10.60.62.65: local.cluster.node, setting-up
 172.16.30.136: local.cluster.node, setting-up
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/8.0 CPU
 0.0/1.0 GPU
 0.0/1.0 accelerator_type:GTX
 0.00/9.160 GiB memory
 0.00/4.580 GiB object_store_memory

Demands:
 (no resource demands)
2023-05-05 18:15:16,033 INFO autoscaler.py:586 -- StandardAutoscaler: Terminating the node with id 10.60.62.65 and ip 10.60.62.65. (outdated)
2023-05-05 18:15:16,034 INFO autoscaler.py:586 -- StandardAutoscaler: Terminating the node with id 172.16.30.136 and ip 172.16.30.136. (outdated)
2023-05-05 18:15:16,034 INFO node_provider.py:172 -- NodeProvider: 10.60.62.65: Terminating node
2023-05-05 18:15:16,034 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['10.60.62.65', '172.16.30.136', '172.16.30.130']
2023-05-05 18:15:16,034 INFO node_provider.py:172 -- NodeProvider: 172.16.30.136: Terminating node
2023-05-05 18:15:16,037 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['10.60.62.65', '172.16.30.136', '172.16.30.130']
2023-05-05 18:15:16,038 INFO autoscaler.py:1366 -- StandardAutoscaler: Queue 2 new nodes for launch
2023-05-05 18:15:16,038 INFO autoscaler.py:462 -- The autoscaler took 0.008 seconds to complete the update iteration.
2023-05-05 18:15:16,038 INFO node_launcher.py:166 -- NodeLauncher0: Got 2 nodes to launch.
2023-05-05 18:15:16,039 INFO monitor.py:428 -- :event_summary:Resized to 8 CPUs, 1 GPUs.
2023-05-05 18:15:16,039 INFO monitor.py:428 -- :event_summary:Removing 2 nodes of type local.cluster.node (outdated).
2023-05-05 18:15:16,090 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['10.60.62.65', '172.16.30.136', '172.16.30.130']
2023-05-05 18:15:16,093 INFO node_provider.py:114 -- ClusterState: Writing cluster state: ['10.60.62.65', '172.16.30.136', '172.16.30.130']
2023-05-05 18:15:16,094 INFO node_launcher.py:166 -- NodeLauncher0: Launching 2 nodes, type local.cluster.node.
2023-05-05 18:15:21,061 INFO autoscaler.py:143 -- The autoscaler took 0.001 seconds to fetch the list of non-terminated nodes.
2023-05-05 18:15:21,062 INFO autoscaler.py:419 -- 
======== Autoscaler status: 2023-05-05 18:15:21.062314 ========
Node status
---------------------------------------------------------------
Healthy:
 1 local.cluster.node
Pending:
 10.60.62.65: local.cluster.node, uninitialized
 172.16.30.136: local.cluster.node, uninitialized
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/8.0 CPU
 0.0/1.0 GPU
 0.0/1.0 accelerator_type:GTX
 0.00/9.160 GiB memory
 0.00/4.580 GiB object_store_memory

Demands:
 (no resource demands)
2023-05-05 18:15:21,065 INFO autoscaler.py:1314 -- Creating new (spawn_updater) updater thread for node 10.60.62.65.
2023-05-05 18:15:21,066 INFO autoscaler.py:1314 -- Creating new (spawn_updater) updater thread for node 172.16.30.136.

==> /tmp/ray/session_latest/logs/monitor.out <==

@Buvaneash what’s your cluster config? Are you using autoscaler? @Alex do we cache the node when restart a ray cluster?

Another way to resolve this is to try KubeRay and k8s can handle the restart quite well.

@yic we are trying to launch the cluster in an on-premises setup with the min and max workers equal to the number of worker ips included and everything else with respect to autoscaler are it’s default values

the cluster.yaml config that we are using is as below

# A unique identifier for the head node and workers of this cluster.
cluster_name: default

# Running Ray in Docker images is optional (this docker section can be commented out).
# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled. Assumes Docker is installed.
docker:
    image: "test"
    # image: rayproject/ray:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
    container_name: "ray_container"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: False
    run_options:   # Extra options to pass into "docker run"
        - --ulimit nofile=65536:65536
        - --shm-size=11gb
        - --device=/dev/dri:/dev/dri
        - --env="DISPLAY"
        - --expose 22
        - --expose 8265
        - --cap-add SYS_PTRACE
        - --env-file=$HOME/.env
        - $(nvidia-smi >> null && echo --gpus all || echo '')
        - --volume=$HOME/.ssh/:/root/.ssh
        - --volume=$SHARED_VOLUME:$SHARED_VOLUME
        - --volume=/tmp/ray_logs:/tmp
        - --volume=$HOME/car_logs:/logs

provider:
    type: local
    head_ip: 172.16.30.130
    # You may need to supply a public ip for the head node if you need
    # to run `ray up` from outside of the Ray cluster's network
    # (e.g. the cluster is in an AWS VPC and you're starting ray from your laptop)
    # This is useful when debugging the local node provider with cloud VMs.
    # external_head_ip: YOUR_HEAD_PUBLIC_IP
    worker_ips: [10.60.62.65,172.16.30.136]
    # Optional when running automatic cluster management on prem. If you use a coordinator server,
    # then you can launch multiple autoscaling clusters on the same set of machines, and the coordinator
    # will assign individual nodes to clusters as needed.
    #    coordinator_address: "<host>:<port>"

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: car
    # You can comment out `ssh_private_key` if the following machines don't need a private key for SSH access to the Ray
    # cluster:
    #   (1) The machine on which `ray up` is executed.
    #   (2) The head node of the Ray cluster.
    #
    # The machine that runs ray up executes SSH commands to set up the Ray head node. The Ray head node subsequently
    # executes SSH commands to set up the Ray worker nodes. When you run ray up, ssh credentials sitting on the ray up
    # machine are copied to the head node -- internally, the ssh key is added to the list of file mounts to rsync to head node.
    # ssh_private_key: ~/.ssh/id_rsa

# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
# Typically, min_workers == max_workers == len(worker_ips).
min_workers: 2

# The maximum number of workers nodes to launch in addition to the head node.
# This takes precedence over min_workers.
# Typically, min_workers == max_workers == len(worker_ips).
max_workers: 2
# The default behavior for manually managed clusters is
# min_workers == max_workers == len(worker_ips),
# meaning that Ray is started on all available nodes of the cluster.
# For automatically managed clusters, max_workers is required and min_workers defaults to 0.

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

idle_timeout_minutes: 5

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH. E.g. you could save your conda env to an environment.yaml file, mount
# that directory to all nodes and call `conda -n my_env -f /path1/on/remote/machine/environment.yaml`. In this
# example paths on all nodes must be the same (so that conda can be called always with the same argument)
file_mounts: {
        #    "/path2/on/remote/machine": "/path2/on/local/machine",
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up each nodes.
setup_commands: []
    # If we have e.g. conda dependencies stored in "/path1/on/local/machine/environment.yaml", we can prepare the
    # work environment on each worker by:
    #   1. making sure each worker has access to this file i.e. see the `file_mounts` section
    #   2. adding a command here that creates a new conda environment on each node or if the environment already exists,
    #     it updates it:
    #      conda env create -q -n my_venv -f /path1/on/local/machine/environment.yaml || conda env update -q -n my_venv -f /path1/on/local/machine/environment.yaml
    #
    # Ray developers:
    # you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
  # If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
  # In that case we'd have to activate that env on each node before running `ray`:
  # - conda activate my_venv && ray stop
  # - conda activate my_venv && ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
    - ray stop
    - ulimit -c unlimited && ray start --head --dashboard-host=0.0.0.0 --port=1234 --autoscaling-config=~/ray_bootstrap_config.yaml
    - ln -sfn $(readlink -f /tmp/ray/session_latest | cut -d'/' -f4-) /tmp/ray/session_latest
    - /prometheus-2.42.0.linux-amd64/prometheus --config.file=/tmp/ray/session_latest/metrics/prometheus/prometheus.yml &
    - grafana-server -homepath /usr/share/grafana --config /tmp/ray/session_latest/metrics/grafana/grafana.ini web &

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
  # If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
  # In that case we'd have to activate that env on each node before running `ray`:
  # - conda activate my_venv && ray stop
  # - ray start --address=$RAY_HEAD_IP:6379
    - ray stop
    - ray start --address=$RAY_HEAD_IP:1234 --resources='{"work'${CAR}${NODEID}'":'${WORKRES}',"det'${CAR}${NODEID}'":'${DETRES}',"feat'${CAR}${NODEID}'":'${FEATRES}'}'
    - ln -sfn $(readlink -f /tmp/ray/session_latest | cut -d'/' -f4-) /tmp/ray/session_latest

@yic Can you please suggest, what could be going wrong for this issue?

I have the same problem. I’m running Ray2.8 and am trying to set up a local cluster. It seems like when the worker nodes are left uninitialised there has been no network traffic between the head node and the worker nodes at all. As far as I can tell no traffic is blocked between the machines and all ports are open. I can find no error message in the logs. The last log entry says "
[…] INFO autoscaler.py:1331 – Creating new (spawn_updater) updater thread for node […]". Just like for @Buvaneash it sometimes work for me if I restart the cluster several times. Any idea what might be causing this? Is there a fix available?

Same here. Doesnt start the worker node but the 5th time i restart the cluster suddenly it works.