This is the ray_conf.yaml
I am using:
cluster_name: default
provider:
type: local
head_ip: 109.248.175.190
worker_ips: [185.244.175.188]
auth:
ssh_user: ubuntu
ssh_private_key: ~/.ssh/ed25519
min_workers: 1
max_workers: 1
file_mounts: {
"~/environment.yml": "/home/daniil/Lania/dragonshore/environment.yml"
}
setup_commands:
- conda env create -f ~/environment.yml || conda env update -f ~/environment.yml
head_start_ray_commands:
- conda activate ray && mlflow server --backend-store-uri mlflow_data --host 0.0.0.0 -p 8889 &> mlflow.log &
- conda activate ray && ray stop
- conda activate ray && ulimit -c unlimited && ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
- conda activate ray && ray stop
- conda activate ray && ray start --address=$RAY_HEAD_IP:6379
after running ray up ray_conf.yaml --no-config-cache -v
I get the following error:
Cluster: default
File Mount: (~/environment.yml:/home/daniil/ray_test/environment.yml) refers to a file.
To ensure this mount updates properly, please use a directory.
Checking Local environment settings
2022-06-09 18:54:34,504 INFO node_provider.py:49 -- ClusterState: Loaded cluster state: ['185.244.175.188', '109.248.175.190']
Updating cluster configuration and running full setup.
Cluster Ray runtime will be restarted. Confirm [y/N]: y
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
<1/1> Setting up head node
Prepared bootstrap config
2022-06-09 18:54:35,534 INFO node_provider.py:110 -- ClusterState: Writing cluster state: ['185.244.175.188', '109.248.175.190']
New status: waiting-for-ssh
[1/7] Waiting for SSH to become available
Running `uptime` as a test.
Fetched IP: 109.248.175.190
Running `uptime`
Shared connection to 109.248.175.190 closed.
SSH still not available (SSH command failed.), retrying in 5 seconds.
Running `uptime`
Shared connection to 109.248.175.190 closed.
SSH still not available (SSH command failed.), retrying in 5 seconds.
Running `uptime`
Shared connection to 109.248.175.190 closed.
SSH still not available (SSH command failed.), retrying in 5 seconds.
And this keeps going forever.
python: 3.9.12
ray: 1.13.0 (same thing for 1.12.1)
Interestingly, if I use --use-normal-shells
, then it is able to connect with SSH just fine (but I then get other issues with it not going through my .zshrc
file, so running conda breaks)