Shared connection closed using Ray Cluster Launch on local

I have been able to create a cluster manually with my nodes on-prem. I get some errors when trying to implement the autoscaling so I tried to use the cluster launcher method instead. I haven’t been able to successfully do it. I’ve tried both the manual approach and coordinator server approach and they both give the same general error (shared connection to host IP is closed)

The yaml file I am using:

# A unique identifier for the head node and workers of this cluster.
cluster_name: mbz

# Running Ray in Docker images is optional (this docker section can be commented out).
# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled. Assumes Docker is installed.
docker:
    image: "rayproject/ray-ml:latest-cpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
    # image: rayproject/ray:latest-gpu   # use this one if you don't need ML dependencies, it's faster to pull
    container_name: "ray_container"
    # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
    # if no cached version is present.
    pull_before_run: True
    run_options:   # Extra options to pass into "docker run"
        - --ulimit nofile=65536:65536

provider:
    type: local
    head_ip: mb2
    # You may need to supply a public ip for the head node if you need
    # to run `ray up` from outside of the Ray cluster's network
    # (e.g. the cluster is in an AWS VPC and you're starting ray from your laptop)
    # This is useful when debugging the local node provider with cloud VMs.
    # external_head_ip: YOUR_HEAD_PUBLIC_IP
    worker_ips: [node1, node2]
    # Optional when running automatic cluster management on prem. If you use a coordinator server,
    # then you can launch multiple autoscaling clusters on the same set of machines, and the coordinator
    # will assign individual nodes to clusters as needed.
    # coordinator_address: 172.16.69.158:7777

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: root
    # You can comment out `ssh_private_key` if the following machines don't need a private key for SSH access to the Ray
    # cluster:
    #   (1) The machine on which `ray up` is executed.
    #   (2) The head node of the Ray cluster.
    #
    # The machine that runs ray up executes SSH commands to set up the Ray head node. The Ray head node subsequently
    # executes SSH commands to set up the Ray worker nodes. When you run ray up, ssh credentials sitting on the ray up
    # machine are copied to the head node -- internally, the ssh key is added to the list of file mounts to rsync to head node.
    # ssh_private_key: ~/.ssh/id_rsa

# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
# Typically, min_workers == max_workers == len(worker_ips).
min_workers: 2

# The maximum number of workers nodes to launch in addition to the head node.
# This takes precedence over min_workers.
# Typically, min_workers == max_workers == len(worker_ips).
max_workers: 2
# The default behavior for manually managed clusters is
# min_workers == max_workers == len(worker_ips),
# meaning that Ray is started on all available nodes of the cluster.
# For automatically managed clusters, max_workers is required and min_workers defaults to 0.

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

idle_timeout_minutes: 5

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH. E.g. you could save your conda env to an environment.yaml file, mount
# that directory to all nodes and call `conda -n my_env -f /path1/on/remote/machine/environment.yaml`. In this
# example paths on all nodes must be the same (so that conda can be called always with the same argument)
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
# rsync_exclude:
#     - "**/.git"
#     - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
# rsync_filter:
#     - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up each nodes.
setup_commands:
    - cd anaconda3/bin ; source activate LarusTF
    - conda activate LarusTF
    - apt install nfs-common;y
    - mkdir ~/HeadNode/
    - pkill -f tensorflow_model_server
    - mount 172.16.69.158:/media/root/prophet/DevOps ~/HeadNode/
    - cd ~/HeadNode/main-dev-1.0.3/bnlwe-da-p-80200-prophetball/
    - pip install -r requirements.txt
    - pip install -e .
    - DATA="/models/TensorFlowServe"
    - docker pull tensorflow/serving:latest
    - docker run -t --rm -p 8501:8501 -v "$(pwd)$DATA:$DATA" tensorflow/serving --model_config_file="/models/TensorFlowServe/models_TFServe.config.txt" --model_config_file_poll_wait_seconds=6000 --prefer_tflite_model=false & 
    
    # If we have e.g. conda dependencies stored in "/path1/on/local/machine/environment.yaml", we can prepare the
    # work environment on each worker by:
    #   1. making sure each worker has access to this file i.e. see the `file_mounts` section
    #   2. adding a command here that creates a new conda environment on each node or if the environment already exists,
    #     it updates it:
    #      conda env create -q -n my_venv -f /path1/on/local/machine/environment.yaml || conda env update -q -n my_venv -f /path1/on/local/machine/environment.yaml
    #
    # Ray developers:
    # you probably want to create a Docker image that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
    # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
    # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
  # If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
  # In that case we'd have to activate that env on each node before running `ray`:
  # - conda activate my_venv && ray stop
  # - conda activate my_venv && ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
    - ray stop
    - ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
  # If we have e.g. conda dependencies, we could create on each node a conda environment (see `setup_commands` section).
  # In that case we'd have to activate that env on each node before running `ray`:
  # - conda activate my_venv && ray stop
  # - ray start --address=$RAY_HEAD_IP:6379
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379

error log:

Cluster: mbz

Checking Local environment settings
Updating cluster configuration and running full setup.
Cluster Ray runtime will be restarted. Confirm [y/N]: y

<1/1> Setting up head node
  Prepared bootstrap config
  New status: waiting-for-ssh
  [1/7] Waiting for SSH to become available
    Running `uptime` as a test.
    Fetched IP: 172.16.69.163
    Running `uptime`
      Full command is `ssh -tt -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/aed91b2b93/%C -o ControlPersist=10s -o ConnectTimeout=5s root@172.16.69.163 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
Warning: Permanently added '172.16.69.163' (ECDSA) to the list of known hosts.
 10:40:41 up 21:59,  1 user,  load average: 1.91, 0.78, 0.31
Shared connection to 172.16.69.163 closed.
    Success.
  Updating cluster configuration. [hash=da38ae13b1dae2f6f419d444a4b4adb3d2f183bd]
  New status: syncing-files
  [2/7] Processing file mounts
    Running `mkdir -p ~`
      Full command is `ssh -tt -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/aed91b2b93/%C -o ControlPersist=10s -o ConnectTimeout=120s root@172.16.69.163 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p ~)'`
Shared connection to 172.16.69.163 closed.
    Running `rsync --rsh ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/aed91b2b93/%C -o ControlPersist=10s -o ConnectTimeout=120s -avz /tmp/ray-bootstrap-4fxbuadm root@172.16.69.163:~/ray_bootstrap_config.yaml`
sending incremental file list
ray-bootstrap-4fxbuadm

sent 890 bytes  received 35 bytes  1,850.00 bytes/sec
total size is 2,919  speedup is 3.16
    `rsync`ed /tmp/ray-bootstrap-4fxbuadm (local) to ~/ray_bootstrap_config.yaml (remote)
    ~/ray_bootstrap_config.yaml from /tmp/ray-bootstrap-4fxbuadm
  [3/7] No worker file mounts to sync
  New status: setting-up
  [4/7] No initialization commands to run.
  [5/7] Initalizing command runner
  [6/7] Running setup commands
    (0/12) cd anaconda3/bin ; source activate LarusTF
    Running `cd anaconda3/bin ; source activate LarusTF`
      Full command is `ssh -tt -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/aed91b2b93/%C -o ControlPersist=10s -o ConnectTimeout=120s root@172.16.69.163 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (cd anaconda3/bin ; source activate LarusTF)'`
Shared connection to 172.16.69.163 closed.
    (1/12) conda activate LarusTF
    Running `conda activate LarusTF`
      Full command is `ssh -tt -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/aed91b2b93/%C -o ControlPersist=10s -o ConnectTimeout=120s root@172.16.69.163 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (conda activate LarusTF)'`
Shared connection to 172.16.69.163 closed.
    (2/12) apt install nfs-common;y
    Running `apt install nfs-common;y`
      Full command is `ssh -tt -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_63a9f0ea7b/aed91b2b93/%C -o ControlPersist=10s -o ConnectTimeout=120s root@172.16.69.163 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (apt install nfs-common;y)'`
Reading package lists... Done
Building dependency tree       
Reading state information... Done
nfs-common is already the newest version (1:1.3.4-2.5ubuntu3.3).
The following packages were automatically installed and are no longer required:
  linux-headers-5.8.0-49-generic linux-hwe-5.8-headers-5.8.0-49 linux-image-5.8.0-49-generic linux-modules-5.8.0-49-generic linux-modules-extra-5.8.0-49-generic
Use 'apt autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 2 not upgraded.
y: command not found
Shared connection to 172.16.69.163 closed.
  New status: update-failed
  !!!
  {'message': 'SSH command failed.'}
  SSH command failed.
  !!!
  
  Failed to setup head node.

Note: as you may have realized, my cluster shares a volume through NFS so I don’t need to mount any files. I am also able to ssh from host to node and vice versa for all nodes.

Hi All, I face the same error. What is the solution for this problem? It breaks during docker exec.
Please let me know as this has become blocker issue.
Cloud Service : Azure
Regards
Rani