Not able to ssh into head node during ray up

This is the ray_conf.yaml I am using:

cluster_name: default

provider:
    type: local
    head_ip: 109.248.175.190
    worker_ips: [185.244.175.188]

auth:
    ssh_user: ubuntu
    ssh_private_key: ~/.ssh/ed25519
min_workers: 1
max_workers: 1

file_mounts: {
  "~/environment.yml": "/home/daniil/Lania/dragonshore/environment.yml"
}

setup_commands: 
  - conda env create -f ~/environment.yml || conda env update -f ~/environment.yml

head_start_ray_commands:
  - conda activate ray && mlflow server --backend-store-uri mlflow_data --host 0.0.0.0 -p 8889 &> mlflow.log &
  - conda activate ray && ray stop
  - conda activate ray && ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
  - conda activate ray && ray stop
  - conda activate ray && ray start --address=$RAY_HEAD_IP:6379

after running ray up ray_conf.yaml --no-config-cache -v I get the following error:

Cluster: default

File Mount: (~/environment.yml:/home/daniil/ray_test/environment.yml) refers to a file.
 To ensure this mount updates properly, please use a directory.
Checking Local environment settings
2022-06-09 18:54:34,504 INFO node_provider.py:49 -- ClusterState: Loaded cluster state: ['185.244.175.188', '109.248.175.190']
Updating cluster configuration and running full setup.
Cluster Ray runtime will be restarted. Confirm [y/N]: y

Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

<1/1> Setting up head node
  Prepared bootstrap config
2022-06-09 18:54:35,534 INFO node_provider.py:110 -- ClusterState: Writing cluster state: ['185.244.175.188', '109.248.175.190']
  New status: waiting-for-ssh
  [1/7] Waiting for SSH to become available
    Running `uptime` as a test.
    Fetched IP: 109.248.175.190
    Running `uptime`
Shared connection to 109.248.175.190 closed.
    SSH still not available (SSH command failed.), retrying in 5 seconds.
    Running `uptime`
Shared connection to 109.248.175.190 closed.
    SSH still not available (SSH command failed.), retrying in 5 seconds.
    Running `uptime`
Shared connection to 109.248.175.190 closed.
    SSH still not available (SSH command failed.), retrying in 5 seconds.

And this keeps going forever.

python: 3.9.12
ray: 1.13.0 (same thing for 1.12.1)

Interestingly, if I use --use-normal-shells, then it is able to connect with SSH just fine (but I then get other issues with it not going through my .zshrc file, so running conda breaks)

Would you mind re-running with more vā€™s and sharing the output?
ray up ray_conf.yaml --no-config-cache -vvvvvvvvv

Hi @dorekhov1, could you provide more verbose output so we can diagnose this? Were you able to work around the issue?

Yeah, I fixed it. The problem was that I was expecting conda to install ray on the cluster, but it turns out that the cluster needs ray installed before it can run anything.

1 Like