How do I troubleshoot nodes that remaining uninitialized?

Hello, I’ve been bringing up lots of different Ray clusters using different versions, topologies, etc.

Periodically, for reasons unknown to me, I sometimes get one or more nodes stuck pending in the “uninitialized” state.

It doesn’t seem to matter how long I wait, they never get healthy. Re-up’ing doesn’t help either.

Where can I look to find out what is stuck? I have looked at the ‘ray up’ outputs, and the many logs under /tmp/ray/session_latest but I can’t find anything related.

This particular instances is running 1.5.0rc1, but I have seen this sort of thing in 1.4.0, 1.4.1, and a master commit from a few weeks ago.

======== Autoscaler status: 2021-07-23 09:47:58.040204 ========
Node status
---------------------------------------------------------------
Healthy:
 2 local.cluster.node
Pending:
 10.0.1.2: local.cluster.node, uninitialized
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 0.0/152.0 CPU
 0.00/20.710 GiB memory
 0.00/9.559 GiB object_store_memory

Demands:
 (no resource demands)
2021-07-23 09:48:03,058	INFO autoscaler.py:356 --

@Ameer_Haj_Ali can you take a look at this?

Referring to @dmitri and @mwtian.

@djakubiec Could you post the cluster config yaml used with ray up and any other details about the cluster?
If you’re able to consistently reproduce the problem, that would help.

For on-prem clusters, I’d actually recommend setting up the cluster manually by running ray start on each machine (or by writing a simple script to do this).
https://docs.ray.io/en/master/cluster/cloud.html#manual-ray-cluster-setup

Hi @Dmitri, it looks like the problem was a stale running Docker container was somehow left over from a previous cluster run. I’m not quite sure how it was interfering, but after I killed it manually and restarted the cluster it did then initialize correctly.

I had another separate instance where a node didn’t start because I had not yet built the corresponding Docker image on that machine.

But I guess my only general comment is that when a node fails to initialize there doesn’t seem to be much information displayed about why… or maybe I don’t know where to look for it? It always seems to be quite a hunt to figure out what is wrong.

Anyways, for what its worth here is the YAML you asked for:

# A unique identifier for the head node and workers of this cluster.
cluster_name: fvq02

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled. Assumes Docker is installed.
docker:
    #image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
    #image: "rayproject/ray-ml:latest-cpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
    #image: rayproject/ray:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
    #container_name: "ray_container"
    image: "focusvq/ray02:latest"
    container_name: "fvq_ray02"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    #pull_before_run: True
    pull_before_run: False
    # Extra options to pass into "docker run"
    run_options: [ 
        '--memory=96g',
        '--shm-size=32g',
        #'--mount type=bind,source=/var/ray,target=/var/ray',
        '--mount type=bind,source=/var/run/flcs.sock,target=/var/run/flcs.sock',
        '--mount type=bind,source=/home/focusvq,target=/home/focusvq,readonly',
        '--mount type=bind,source=/ceph,target=/ceph,readonly',
        '--mount type=bind,source=/ceph/var/ray/share,target=/share',
        '--group-add 1010',     # fvq-metrics
        '--group-add 5000',     # fvq-dev
        '--group-add 5001',     # fvq-web
        #'--group-add 5002',     # fvq-git
        '--group-add 5003',     # fvq-external
        '--group-add 5005',     # fvq-ops
        '--group-add 5007',     # fvq-log
        #'--user=501:501',
        ]

# FocusAE NOTE: Avoid using hostnames below since we generally map those to loopback addresses.  That confuses certain parts of the Ray cluster assembly
provider:
    type: local
    #head_ip: YOUR_HEAD_NODE_HOSTNAME
    # You may need to supply a public ip for the head node if you need
    # to run `ray up` from outside of the Ray cluster's network
    # (e.g. the cluster is in an AWS VPC and you're starting ray from your laptop)
    # This is useful when debugging the local node provider with cloud VMs.
    # external_head_ip: YOUR_HEAD_PUBLIC_IP
    #worker_ips: [WORKER_NODE_1_HOSTNAME, WORKER_NODE_2_HOSTNAME, ... ]
    # NOTE: Avoid using hostnames, since the Focus cluster maps these to loopback addresses and confuses some stages of Ray
    # Optional when running automatic cluster management on prem. If you use a coordinator server,
    # then you can launch multiple autoscaling clusters on the same set of machines, and the coordinator
    # will assign individual nodes to clusters as needed.
    #    coordinator_address: "<host>:<port>"
    coordinator_address: "10.0.1.250:1300"

# How Ray will authenticate with newly launched nodes.
auth:
    #ssh_user: YOUR_USERNAME
    ssh_user: focusvq
    # Optional if an ssh private key is necessary to ssh to the cluster.
    ssh_private_key: ~/.ssh/id_rsa

# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
# Typically, min_workers == max_workers == len(worker_ips).
min_workers: 43

# The maximum number of workers nodes to launch in addition to the head node.
# This takes precedence over min_workers.
# Typically, min_workers == max_workers == len(worker_ips).
max_workers: 43
# The default behavior for manually managed clusters is
# min_workers == max_workers == len(worker_ips),
# meaning that Ray is started on all available nodes of the cluster.
# For automatically managed clusters, max_workers is required and min_workers defaults to 0.

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

idle_timeout_minutes: 5

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up each nodes.
setup_commands: [
]
    # Note: if you're developing Ray, you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"

# Custom commands that will be run on the head node after common setup.
head_setup_commands: [
]

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    #- ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
    - ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --system-config='{"object_spilling_config":"{\"type\":\"filesystem\",\"params\":{\"directory_path\":\"/share/spill\"}}"}'

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379

Anyways, thanks for the help!

Oh, I think you may be able to glean some node startup info from /tmp/ray/session_latest/logs/monitor.out on the head node.

It seems we need to do a better job of cleaning docker state when disconnecting a node with the LocalNodeProvider.

1 Like

Tracking bad docker behavior here:

1 Like

Hi folks, I think I might be running into a similar issue here. Could you share how you clean the stale running Docker container? Did you manually do it from the worker node?