Ray cluster-launcher not starting up properly

mt-clemente · February 5, 2025, 2:20pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hi, I am trying to set up a ray cluster but it seems to be hanging at some point in the process. The ray up cluster.yaml command seems to run perfectly fine but then one of two things happen:

ray status returns No cluster status. It may take a few seconds for the Ray internal services to start up. and the cluster only launches the head node.
The status command returns:

Node status
---------------------------------------------------------------
Active:
1 local.cluster.node
Pending:
local.cluster.node, 7 launching
Recent failures:
(no failures)

Resources
---------------------------------------------------------------
Usage:
0.0/16.0 CPU
0.0/1.0 GPU
0B/36.23GiB memory
0B/18.12GiB object_store_memory`

The dashboard seems to work fine in both cases, i could not find anything in the logs.

Also, whatever happens, the ray monitor command does not work and hangs, ray down hangs too until i run ray stop --force.

My cluster yaml file is basically the example one with no docker. All my machines use ubuntu and a conda base environment in python 3.12.8 and a 2.42 ray version.
I have sometimes been able to launch a cluster with only 2 workers instead of 10 but that was not reproducible.

Edit:
I have tested launching it in two steps:

ray up cluster.yaml → The satus shows launching forever (the nodes do not get launched)
Manually launching the nodes using ray start directly in the machines. This works, i can see the new nodes on the dashboard but ray status still shows the ‘launching state’ and does not show new nodes wether they were in the initial config file or not.

jjyao · February 16, 2025, 5:37am

Hi @mt-clemente could you share your cluster.yaml so we can debug it?

mt-clemente · February 17, 2025, 3:16pm

# A unique identifier for the head node and workers of this cluster.
cluster_name: default
provider:
    type: local
    head_ip: 10.208.177.97
    # You may need to supply a public ip for the head node if you need
    # to run `ray up` from outside of the Ray cluster's network
    # (e.g. the cluster is in an AWS VPC and you're starting ray from your laptop)
    # This is useful when debugging the local node provider with cloud VMs.
    # external_head_ip: YOUR_HEAD_PUBLIC_IP
    worker_ips:
        - 10.208.177.80
        - 10.208.177.43
        - 10.208.177.42
        - 10.208.177.136
        - 10.208.177.41
        - 10.208.177.93

    # Optional when running automatic cluster management on prem. If you use a coordinator server,
    # then you can launch multiple autoscaling clusters on the same set of machines, and the coordinator
    # will assign individual nodes to clusters as needed.
    #    coordinator_address: "<host>:<port>"

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: m84396953
    # You can comment out `ssh_private_key` if the following machines don't need a private key for SSH access to the Ray
    # cluster:
    #   (1) The machine on which `ray up` is executed.
    #   (2) The head node of the Ray cluster.
    #
    # The machine that runs ray up executes SSH commands to set up the Ray head node. The Ray head node subsequently
    # executes SSH commands to set up the Ray worker nodes. When you run ray up, ssh credentials sitting on the ray up
    # machine are copied to the head node -- internally, the ssh key is added to the list of file mounts to rsync to head node.
    ssh_private_key: ~/.ssh/id_ed25519

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []
    # - conda activate .venv

# List of shell commands to run to set up each nodes.
setup_commands: []

    # If we have e.g. conda dependencies stored in "/path1/on/local/machine/environment.yaml", we can prepare the
    # work environment on each worker by:
    #   1. making sure each worker has access to this file i.e. see the `file_mounts` section
    #   2. adding a command here that creates a new conda environment on each node or if the environment already exists,
    #     it updates it:
    #      conda env create -q -n my_venv -f /path1/on/local/machine/environment.yaml || conda env update -q -n my_venv -f /path1/on/local/machine/environment.yaml
    #
    # Ray developers:
    # you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
  # If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
  # In that case we'd have to activate that env on each node before running `ray`:
  # - conda activate my_venv && ray stop
#   - conda activate my_venv && ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
    
    
    - conda activate base && ray stop --force
    - conda activate base && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
  # If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
  # In that case we'd have to activate that env on each node before running `ray`:
  # - bash -i -c 'conda activate base my_venv && ray stop'
  # - ray start --address=$RAY_HEAD_IP:6379
   
    - conda activate base && ray stop --force
    - conda activate base && ray start --address=$RAY_HEAD_IP:6379

This is the latest thing that i have tried, but there have been quite a few iterations and none have worked. Any help would be greatly appreciated. I have made the cluster work by just starting it and stopping it manually, but it would be much easier to have access to ray up.

Rajpurohit_Jayes · March 6, 2025, 4:45pm

Hi @jjyao, facing same issue. can you help in resolving this?

Topic		Replies	Views
Ray hangs in 2 different places, fails to launch anything on workers in ssh mode Ray Clusters	0	371	April 21, 2023
Launching Cluster on AWS hangs Ray Clusters	2	580	May 3, 2023
Odd behavior with Ray cluster setup with Docker Ray Core	0	316	April 13, 2023
Ray Up Not Starting Woker Ray Clusters	1	1365	May 12, 2022
Ray cluster worker nodes stuck at uninitialized Ray Core	7	1387	August 12, 2024

Ray cluster-launcher not starting up properly

Related topics