Ray repeatedly dies on startup

Why does ray like to repeatedly die with messages like this? I’ve been using ray for a few months and have gotten it to work, but I have noticed the amount of time spent setting up and tearing down VMs is ridiculous due to issues like this. Am I the only one having this problem?

9490fa75f515: Pull complete 
0780019b2838: Pull complete 
59293d71a1fc: Pull complete 
c4623a7ed5da: Pull complete 
Digest: sha256:6342f87c9fff0e8391ccf6b491c10d18d42e76c6c06210b4cdbb04ceec03e201
Status: Downloaded newer image for rayproject/ray-ml:latest-gpu
docker.io/rayproject/ray-ml:latest-gpu
Shared connection to 35.222.31.216 closed.
Shared connection to 35.222.31.216 closed.
Shared connection to 35.222.31.216 closed.
Shared connection to 35.222.31.216 closed.
Shared connection to 35.222.31.216 closed.
2021-09-05 22:29:13,049 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1630898953666-5cb4b40cc3fe0-781f9272-11e7e78c to finish...
2021-09-05 22:29:18,447 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1630898953666-5cb4b40cc3fe0-781f9272-11e7e78c finished.
  New status: update-failed
  !!!
  SSH command failed.
  !!!
  
  Failed to setup head node.

I’ve seen this issue reported by many users at this point. It seems to be uniquely a GCP issue. I would first try ray up -vvv to get the specific SSH command that is failing.

After you get that ssh command, I recommend running it yourself with ssh -vvv to get detailed debug logs. It’d be great if you could post those here so that we could help diagnose the issue.

cc @Dmitri

I ran it just now and it failed on the docker pull. When I re-ran it (forgetting to add the -vvv to the ssh), it succeeded. (On the latest GPU VMs I am working with, it does not always fail at this part.) It’s late, but tomorrow I will try to isolate it as requested. Note I have altered the project name in my copy/paste to protect myself.

Shared connection to 35.222.31.216 closed.
    Running `docker pull rayproject/ray-ml:latest-gpu`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_ting-1-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@35.222.31.216 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker pull rayproject/ray-ml:latest-gpu)'`
latest-gpu: Pulling from rayproject/ray-ml
feac53061382: Pull complete 
4e74d880cc84: Pull complete 
b88d0e36cef0: Pull complete 
a4a81bdd364a: Pull complete 
f34588647641: Pull complete 
cb625e653a5b: Pull complete 
8361034ddb18: Pull complete 
15971bbf4cfa: Pull complete 
71811213304a: Pull complete 
ff6d090c8a0a: Pull complete 
54e1344d56f5: Pull complete 
2324cc0fe838: Pull complete 
b5ce686a03b1: Pull complete 
4dea84a783ac: Pull complete 
fdee1378325f: Pull complete 
6c53bcd05a46: Pull complete 
2d6f22c6ab1a: Pull complete 
d124bc6f7552: Pull complete 
f718cbf90ebe: Pull complete 
9490fa75f515: Pull complete 
0780019b2838: Pull complete 
59293d71a1fc: Pull complete 
c4623a7ed5da: Extracting [==================================================>]  3.453GB/3.453GB
Shared connection to 35.222.31.216 closed.
2021-09-06 00:27:37,442 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1630906057524-5cb4ce83885a1-277c502a-eb6a73a3 to finish...
2021-09-06 00:27:42,893 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1630906057524-5cb4ce83885a1-277c502a-eb6a73a3 finished.
  New status: update-failed
  !!!
  {'message': 'SSH command failed.'}
  SSH command failed.
  !!!
  
  Failed to setup head node.
(env) chris_chiasson@penguin:~/te$ ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_ting-1-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@35.222.31.216 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker pull rayproject/ray-ml:latest-gpu)'
Warning: Permanently added '35.222.31.216' (ECDSA) to the list of known hosts.
latest-gpu: Pulling from rayproject/ray-ml
feac53061382: Pull complete 
4e74d880cc84: Pull complete 
b88d0e36cef0: Pull complete 
a4a81bdd364a: Pull complete 
f34588647641: Pull complete 
cb625e653a5b: Pull complete 
8361034ddb18: Pull complete 
15971bbf4cfa: Pull complete 
71811213304a: Pull complete 
ff6d090c8a0a: Pull complete 
54e1344d56f5: Pull complete 
2324cc0fe838: Pull complete 
b5ce686a03b1: Pull complete 
4dea84a783ac: Pull complete 
fdee1378325f: Pull complete 
6c53bcd05a46: Pull complete 
2d6f22c6ab1a: Pull complete 
d124bc6f7552: Pull complete 
f718cbf90ebe: Pull complete 
9490fa75f515: Pull complete 
0780019b2838: Pull complete 
59293d71a1fc: Pull complete 
c4623a7ed5da: Pull complete 
Digest: sha256:6342f87c9fff0e8391ccf6b491c10d18d42e76c6c06210b4cdbb04ceec03e201
Status: Downloaded newer image for rayproject/ray-ml:latest-gpu
docker.io/rayproject/ray-ml:latest-gpu
Shared connection to 35.222.31.216 closed.
(env) chris_chiasson@penguin:~/te$

As requested, here is a ray up -vvv followed by ssh -vvv on the command that failed. Thankfully ssh also looked like it failed this time, so hopefully it will help with reproduction. It looks like ssh “exited” but was still producing output to the terminal, as can be seen from the terminal prompt a few lines above the end ((env) chris_chiasson@penguin:~/te$ debug3: send packet: type 80). I waited a while and then copied all the output. In the copy/paste, I replaced the project name and my local directory name to protect my privacy.

(env) chris_chiasson@penguin:~/te$ ray up -vvv gpu-cluster-full.yaml 
Cluster: default

Loaded cached provider configuration from /tmp/ray-config-9516e8899f656f94897811887e256109bf9f4f05
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
No head node found. Launching a new cluster. Confirm [y/N]: y

Acquiring an up-to-date head node
2021-09-06 18:29:30,309 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1630970969562-5cb5c05479ceb-1feec835-790ea88d to finish...
2021-09-06 18:30:01,931 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1630970969562-5cb5c05479ceb-1feec835-790ea88d finished.
  Launched a new head node
  Fetching the new head node
  
<1/1> Setting up head node
  Prepared bootstrap config
2021-09-06 18:30:02,981 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1630971003149-5cb5c07481e4b-d03f14be-23cf3045 to finish...
2021-09-06 18:30:08,419 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1630971003149-5cb5c07481e4b-d03f14be-23cf3045 finished.
  New status: waiting-for-ssh
  [1/7] Waiting for SSH to become available
    Running `uptime` as a test.
    Fetched IP: 34.69.71.176
    Running `uptime`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@34.69.71.176 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
ssh: connect to host 34.69.71.176 port 22: Connection timed out
    SSH still not available (SSH command failed.), retrying in 5 seconds.
    Running `uptime`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@34.69.71.176 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
Warning: Permanently added '34.69.71.176' (ECDSA) to the list of known hosts.
ubuntu@34.69.71.176: Permission denied (publickey).
    SSH still not available (SSH command failed.), retrying in 5 seconds.
    Running `uptime`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@34.69.71.176 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
Warning: Permanently added '34.69.71.176' (ECDSA) to the list of known hosts.
 23:30:34 up 0 min,  1 user,  load average: 1.49, 0.37, 0.12
Shared connection to 34.69.71.176 closed.
    Success.
  Updating cluster configuration. [hash=a704061c3833450f2e7b7580855544aaccc71da2]
2021-09-06 18:30:34,532 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1630971034721-5cb5c0929db4e-b76aff0a-79749d99 to finish...
2021-09-06 18:30:39,931 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1630971034721-5cb5c0929db4e-b76aff0a-79749d99 finished.
  New status: syncing-files
  [2/7] Processing file mounts
    Running `mkdir -p /tmp/ray_tmp_mount/default/~/json && chown -R ubuntu /tmp/ray_tmp_mount/default/~/json`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@34.69.71.176 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p /tmp/ray_tmp_mount/default/~/json && chown -R ubuntu /tmp/ray_tmp_mount/default/~/json)'`
Shared connection to 34.69.71.176 closed.
    Running `rsync --rsh ssh -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=120s -avz --exclude **/.git --exclude **/.git/** --filter dir-merge,- .gitignore ./json/ting-1-f3bd25cf2f05.json ubuntu@34.69.71.176:/tmp/ray_tmp_mount/default/~/json/ting-1-f3bd25cf2f05.json`
sending incremental file list
ting-1-f3bd25cf2f05.json

sent 1,764 bytes  received 35 bytes  3,598.00 bytes/sec
total size is 2,309  speedup is 1.28
    Running `docker inspect -f '{{.State.Running}}' ray_nvidia_docker || true`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@34.69.71.176 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker inspect -f '"'"'{{.State.Running}}'"'"' ray_nvidia_docker || true)'`
Shared connection to 34.69.71.176 closed.
    `rsync`ed ./json/ting-1-f3bd25cf2f05.json (local) to ~/json/ting-1-f3bd25cf2f05.json (remote)
    ~/json/ting-1-f3bd25cf2f05.json from ./json/ting-1-f3bd25cf2f05.json
    Running `mkdir -p /tmp/ray_tmp_mount/default/~ && chown -R ubuntu /tmp/ray_tmp_mount/default/~`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@34.69.71.176 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p /tmp/ray_tmp_mount/default/~ && chown -R ubuntu /tmp/ray_tmp_mount/default/~)'`
Shared connection to 34.69.71.176 closed.
    Running `rsync --rsh ssh -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=120s -avz --exclude **/.git --exclude **/.git/** --filter dir-merge,- .gitignore /tmp/ray-bootstrap-vq2zw790 ubuntu@34.69.71.176:/tmp/ray_tmp_mount/default/~/ray_bootstrap_config.yaml`
sending incremental file list
ray-bootstrap-vq2zw790

sent 1,294 bytes  received 35 bytes  2,658.00 bytes/sec
total size is 3,355  speedup is 2.52
    Running `docker inspect -f '{{.State.Running}}' ray_nvidia_docker || true`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@34.69.71.176 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker inspect -f '"'"'{{.State.Running}}'"'"' ray_nvidia_docker || true)'`
Shared connection to 34.69.71.176 closed.
    `rsync`ed /tmp/ray-bootstrap-vq2zw790 (local) to ~/ray_bootstrap_config.yaml (remote)
    ~/ray_bootstrap_config.yaml from /tmp/ray-bootstrap-vq2zw790
    Running `mkdir -p /tmp/ray_tmp_mount/default/~ && chown -R ubuntu /tmp/ray_tmp_mount/default/~`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@34.69.71.176 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p /tmp/ray_tmp_mount/default/~ && chown -R ubuntu /tmp/ray_tmp_mount/default/~)'`
Shared connection to 34.69.71.176 closed.
    Running `rsync --rsh ssh -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=120s -avz --exclude **/.git --exclude **/.git/** --filter dir-merge,- .gitignore /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem ubuntu@34.69.71.176:/tmp/ray_tmp_mount/default/~/ray_bootstrap_key.pem`
sending incremental file list
ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem

sent 1,461 bytes  received 35 bytes  997.33 bytes/sec
total size is 1,679  speedup is 1.12
    Running `docker inspect -f '{{.State.Running}}' ray_nvidia_docker || true`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@34.69.71.176 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker inspect -f '"'"'{{.State.Running}}'"'"' ray_nvidia_docker || true)'`
Shared connection to 34.69.71.176 closed.
    `rsync`ed /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem (local) to ~/ray_bootstrap_key.pem (remote)
    ~/ray_bootstrap_key.pem from /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem
  [3/7] No worker file mounts to sync
2021-09-06 18:30:47,970 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1630971047999-5cb5c09f47981-6afc922f-ae8b854d to finish...
2021-09-06 18:30:53,465 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1630971047999-5cb5c09f47981-6afc922f-ae8b854d finished.
  New status: setting-up
  [4/7] No initialization commands to run.
  [5/7] Initalizing command runner
    Running `command -v docker || echo 'NoExist'`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@34.69.71.176 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (command -v docker || echo '"'"'NoExist'"'"')'`
Shared connection to 34.69.71.176 closed.
    Running `docker pull rayproject/ray-ml:latest-gpu`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@34.69.71.176 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker pull rayproject/ray-ml:latest-gpu)'`
latest-gpu: Pulling from rayproject/ray-ml
feac53061382: Pull complete 
4e74d880cc84: Pull complete 
b88d0e36cef0: Pull complete 
a4a81bdd364a: Pull complete 
f34588647641: Pull complete 
cb625e653a5b: Pull complete 
8361034ddb18: Pull complete 
15971bbf4cfa: Pull complete 
71811213304a: Pull complete 
ff6d090c8a0a: Pull complete 
54e1344d56f5: Pull complete 
2324cc0fe838: Pull complete 
b5ce686a03b1: Pull complete 
4dea84a783ac: Pull complete 
fdee1378325f: Pull complete 
6c53bcd05a46: Pull complete 
2d6f22c6ab1a: Pull complete 
d124bc6f7552: Pull complete 
f718cbf90ebe: Pull complete 
9490fa75f515: Pull complete 
0780019b2838: Pull complete 
59293d71a1fc: Pull complete 
c4623a7ed5da: Pull complete 
Digest: sha256:6342f87c9fff0e8391ccf6b491c10d18d42e76c6c06210b4cdbb04ceec03e201
Status: Downloaded newer image for rayproject/ray-ml:latest-gpu
docker.io/rayproject/ray-ml:latest-gpu
Shared connection to 34.69.71.176 closed.
    Running `docker inspect -f '{{.State.Running}}' ray_nvidia_docker || true`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@34.69.71.176 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker inspect -f '"'"'{{.State.Running}}'"'"' ray_nvidia_docker || true)'`
Shared connection to 34.69.71.176 closed.
    Running `docker inspect -f '{{json .Config.Env}}' rayproject/ray-ml:latest-gpu`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@34.69.71.176 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker inspect -f '"'"'{{json .Config.Env}}'"'"' rayproject/ray-ml:latest-gpu)'`
Shared connection to 34.69.71.176 closed.
    Running `cat /proc/meminfo || true`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@34.69.71.176 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (cat /proc/meminfo || true)'`
Shared connection to 34.69.71.176 closed.
    Running `docker info -f '{{.Runtimes}}' `
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@34.69.71.176 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker info -f '"'"'{{.Runtimes}}'"'"' )'`
Shared connection to 34.69.71.176 closed.
2021-09-06 18:35:15,293 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1630971315455-5cb5c19e58578-1f599cdd-5a28bd34 to finish...
2021-09-06 18:35:20,741 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1630971315455-5cb5c19e58578-1f599cdd-5a28bd34 finished.
  New status: update-failed
  !!!
  {'message': 'SSH command failed.'}
  SSH command failed.
  !!!
  
  Failed to setup head node.

Here is the redo with ssh -vvv.

(env) chris_chiasson@penguin:~/te$ ssh -vvv -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFail
ure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@34.69.71.176 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker info -f '"'"'{{.Runtimes}}'"'"' )'
OpenSSH_7.9p1 Debian-10+deb10u2, OpenSSL 1.1.1d  10 Sep 2019
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug2: resolve_canonicalize: hostname 34.69.71.176 is address
debug1: auto-mux: Trying existing master
debug1: Control socket "/tmp/ray_ssh_8ec3601252/c21f969b5f/23757eecdedcbfcc38f7fad06d48dddaf9997a16" does not exist
debug2: ssh_connect_direct
debug1: Connecting to 34.69.71.176 [34.69.71.176] port 22.
debug2: fd 3 setting O_NONBLOCK
debug1: fd 3 clearing O_NONBLOCK
debug1: Connection established.
debug3: timeout: 119914 ms remain after connect
debug1: identity file /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem type -1
debug1: identity file /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem-cert type -1
debug1: Local version string SSH-2.0-OpenSSH_7.9p1 Debian-10+deb10u2
debug1: Remote protocol version 2.0, remote software version OpenSSH_7.9p1 Debian-10+deb10u2
debug1: match: OpenSSH_7.9p1 Debian-10+deb10u2 pat OpenSSH* compat 0x04000000
debug2: fd 3 setting O_NONBLOCK
debug1: Authenticating to 34.69.71.176:22 as 'ubuntu'
debug3: hostkeys_foreach: reading file "/dev/null"
debug3: send packet: type 20
debug1: SSH2_MSG_KEXINIT sent
debug3: receive packet: type 20
debug1: SSH2_MSG_KEXINIT received
debug2: local client KEXINIT proposal
debug2: KEX algorithms: curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group14-sha256,diffie-hellman-group14-sha1,ext-info-c
debug2: host key algorithms: ecdsa-sha2-nistp256-cert-v01@openssh.com,ecdsa-sha2-nistp384-cert-v01@openssh.com,ecdsa-sha2-nistp521-cert-v01@openssh.com,ssh-ed25519-cert-v01@openssh.com,rsa-sha2-512-cert-v01@openssh.com,rsa-sha2-256-cert-v01@openssh.com,ssh-rsa-cert-v01@openssh.com,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,ssh-ed25519,rsa-sha2-512,rsa-sha2-256,ssh-rsa
debug2: ciphers ctos: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com
debug2: ciphers stoc: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com
debug2: MACs ctos: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1
debug2: MACs stoc: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1
debug2: compression ctos: none,zlib@openssh.com,zlib
debug2: compression stoc: none,zlib@openssh.com,zlib
debug2: languages ctos: 
debug2: languages stoc: 
debug2: first_kex_follows 0 
debug2: reserved 0 
debug2: peer server KEXINIT proposal
debug2: KEX algorithms: curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group14-sha256,diffie-hellman-group14-sha1
debug2: host key algorithms: rsa-sha2-512,rsa-sha2-256,ssh-rsa,ecdsa-sha2-nistp256,ssh-ed25519
debug2: ciphers ctos: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com
debug2: ciphers stoc: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com
debug2: MACs ctos: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1
debug2: MACs stoc: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1
debug2: compression ctos: none,zlib@openssh.com
debug2: compression stoc: none,zlib@openssh.com
debug2: languages ctos: 
debug2: languages stoc: 
debug2: first_kex_follows 0 
debug2: reserved 0 
debug1: kex: algorithm: curve25519-sha256
debug1: kex: host key algorithm: ecdsa-sha2-nistp256
debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug3: send packet: type 30
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug3: receive packet: type 31
debug1: Server host key: ecdsa-sha2-nistp256 SHA256:5e50WClpGyS28iyXDTFEFb8yE6s4Gf7+eRdSEwHtCQ8
debug3: hostkeys_foreach: reading file "/dev/null"
Warning: Permanently added '34.69.71.176' (ECDSA) to the list of known hosts.
debug3: send packet: type 21
debug2: set_newkeys: mode 1
debug1: rekey after 134217728 blocks
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug3: receive packet: type 21
debug1: SSH2_MSG_NEWKEYS received
debug2: set_newkeys: mode 0
debug1: rekey after 134217728 blocks
debug1: Will attempt key: /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem  explicit
debug2: pubkey_prepare: done
debug3: send packet: type 5
debug3: receive packet: type 7
debug1: SSH2_MSG_EXT_INFO received
debug1: kex_input_ext_info: server-sig-algs=<ssh-ed25519,ssh-rsa,rsa-sha2-256,rsa-sha2-512,ssh-dss,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521>
debug3: receive packet: type 6
debug2: service_accept: ssh-userauth
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug3: send packet: type 50
debug3: receive packet: type 51
debug1: Authentications that can continue: publickey
debug3: start over, passed a different list publickey
debug3: preferred gssapi-keyex,gssapi-with-mic,publickey,keyboard-interactive,password
debug3: authmethod_lookup publickey
debug3: remaining preferred: keyboard-interactive,password
debug3: authmethod_is_enabled publickey
debug1: Next authentication method: publickey
debug1: Trying private key: /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem
debug3: sign_and_send_pubkey: RSA SHA256:WkudYX4AaTrHi2TTs1iwbISXMBXTMla66aMhN4UDUbo
debug3: sign_and_send_pubkey: signing using rsa-sha2-512
debug3: send packet: type 50
debug2: we sent a publickey packet, wait for reply
debug3: receive packet: type 52
debug1: Authentication succeeded (publickey).
Authenticated to 34.69.71.176 ([34.69.71.176]:22).
debug1: setting up multiplex master socket
debug3: muxserver_listen: temporary control path /tmp/ray_ssh_8ec3601252/c21f969b5f/23757eecdedcbfcc38f7fad06d48dddaf9997a16.aZlJtE3dwH0Mlu3h
debug2: fd 4 setting O_NONBLOCK
debug3: fd 4 is O_NONBLOCK
debug3: fd 4 is O_NONBLOCK
debug1: channel 0: new [/tmp/ray_ssh_8ec3601252/c21f969b5f/23757eecdedcbfcc38f7fad06d48dddaf9997a16]
debug3: muxserver_listen: mux listener channel 0 fd 4
debug2: fd 3 setting TCP_NODELAY
debug3: ssh_packet_set_tos: set IP_TOS 0x08
debug1: control_persist_detach: backgrounding master process
debug1: forking to background
debug1: Entering interactive session.
debug1: pledge: id
debug2: set_control_persist_exit_time: schedule exit in 10 seconds
debug2: control_persist_detach: background process is 4878
debug2: fd 4 setting O_NONBLOCK
debug1: multiplexing control connection
debug2: fd 5 setting O_NONBLOCK
debug3: fd 5 is O_NONBLOCK
debug1: channel 1: new [mux-control]
debug3: channel_post_mux_listener: new mux channel 1 fd 5
debug3: mux_master_read_cb: channel 1: hello sent
debug2: set_control_persist_exit_time: cancel scheduled exit
debug3: mux_master_read_cb: channel 1 packet type 0x00000001 len 4
debug2: mux_master_process_hello: channel 1 slave version 4
debug2: mux_client_hello_exchange: master version 4
debug3: mux_client_forwards: request forwardings: 0 local, 0 remote
debug3: mux_client_request_session: entering
debug3: mux_client_request_alive: entering
debug3: mux_master_read_cb: channel 1 packet type 0x10000004 len 4
debug2: mux_master_process_alive_check: channel 1: alive check
debug3: mux_client_request_alive: done pid = 4880
debug3: mux_client_request_session: session request sent
debug3: mux_master_read_cb: channel 1 packet type 0x10000002 len 200
debug2: mux_master_process_new_session: channel 1: request tty 1, X 0, agent 0, subsys 0, term "xterm-256color", cmd "bash --login -c -i true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker info -f '{{.Runtimes}}' )", env 1
debug3: mux_master_process_new_session: got fds stdin 6, stdout 7, stderr 8
debug1: channel 2: new [client-session]
debug2: mux_master_process_new_session: channel_new: 2 linked to control channel 1
debug2: channel 2: send open
debug3: send packet: type 90
debug3: receive packet: type 80
debug1: client_input_global_request: rtype hostkeys-00@openssh.com want_reply 0
debug3: receive packet: type 4
debug1: Remote: /home/ubuntu/.ssh/authorized_keys:6: key options: agent-forwarding port-forwarding pty user-rc x11-forwarding
debug3: receive packet: type 91
debug2: channel_input_open_confirmation: channel 2: callback start
debug2: client_session2_setup: id 2
debug2: channel 2: request pty-req confirm 1
debug3: send packet: type 98
debug1: Sending environment.
debug1: Sending env LANG = en_US.UTF-8
debug2: channel 2: request env confirm 0
debug3: send packet: type 98
debug1: Sending command: bash --login -c -i true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker info -f '{{.Runtimes}}' )
debug2: channel 2: request exec confirm 1
debug3: send packet: type 98
debug3: mux_session_confirm: sending success reply
debug2: channel_input_open_confirmation: channel 2: callback done
debug2: channel 2: open confirm rwindow 0 rmax 32768
debug3: receive packet: type 99
debug2: channel_input_status_confirm: type 99 id 2
debug2: PTY allocation request accepted on channel 2
debug2: channel 2: rcvd adjust 2097152
debug3: receive packet: type 99
debug2: channel_input_status_confirm: type 99 id 2
debug2: exec request accepted on channel 2
map[nvidia:{nvidia-container-runtime []} runc:{runc []}]
debug3: receive packet: type 98
debug1: client_input_channel_req: channel 2 rtype exit-status reply 0
debug3: mux_exit_message: channel 2: exit message, exitval 0
debug3: receive packet: type 98
debug1: client_input_channel_req: channel 2 rtype eow@openssh.com reply 0
debug2: channel 2: rcvd eow
debug2: channel 2: chan_shutdown_read (i0 o0 sock -1 wfd 6 efd 8 [write])
debug2: channel 2: input open -> closed
debug3: receive packet: type 96
debug2: channel 2: rcvd eof
debug2: channel 2: output open -> drain
debug2: channel 2: obuf empty
debug2: channel 2: chan_shutdown_write (i3 o1 sock -1 wfd 7 efd 8 [write])
debug2: channel 2: output drain -> closed
debug3: receive packet: type 97
debug2: channel 2: rcvd close
debug3: channel 2: will not send data after close
debug2: channel 2: send close
debug3: send packet: type 97
debug2: channel 2: is dead
debug2: channel 2: gc: notify user
debug3: mux_master_session_cleanup_cb: entering for channel 2
debug2: channel 1: rcvd close
debug2: channel 1: output open -> drain
debug2: channel 1: chan_shutdown_read (i0 o1 sock 5 wfd 5 efd -1 [closed])
debug2: channel 1: input open -> closed
debug2: channel 2: gc: user detached
debug2: channel 2: is dead
debug2: channel 2: garbage collecting
debug1: channel 2: free: client-session, nchannels 3
debug3: channel 2: status: The following connections are open:
  #1 mux-control (t16 nr0 i3/0 o1/16 e[closed]/0 fd 5/5/-1 sock 5 cc -1)
  #2 client-session (t4 r0 i3/0 o3/0 e[write]/0 fd -1/-1/8 sock -1 cc -1)

debug2: channel 1: obuf empty
debug2: channel 1: chan_shutdown_write (i3 o1 sock 5 wfd 5 efd -1 [closed])
debug2: channel 1: output drain -> closed
debug2: channel 1: is dead (local)
debug2: channel 1: gc: notify user
debug3: mux_master_control_cleanup_cb: entering for channel 1
debug2: channel 1: gc: user detached
debug2: channel 1: is dead (local)
debug2: channel 1: garbage collecting
debug1: channel 1: free: mux-control, nchannels 2
debug3: channel 1: status: The following connections are open:
  #1 mux-control (t16 nr0 i3/0 o3/0 e[closed]/0 fd 5/5/-1 sock 5 cc -1)

debug2: set_control_persist_exit_time: schedule exit in 10 seconds
debug3: mux_client_read_packet: read header failed: Broken pipe
debug2: Received exit status from master 0
Shared connection to 34.69.71.176 closed.
(env) chris_chiasson@penguin:~/te$ debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug1: ControlPersist timeout expired
debug3: send packet: type 1
debug1: channel 0: free: /tmp/ray_ssh_8ec3601252/c21f969b5f/23757eecdedcbfcc38f7fad06d48dddaf9997a16, nchannels 1
debug3: channel 0: status: The following connections are open:

debug3: fd 0 is not O_NONBLOCK
debug3: fd 1 is not O_NONBLOCK
Transferred: sent 2892, received 2488 bytes, in 14.4 seconds
Bytes per second: sent 201.1, received 173.0
debug1: Exit status -1

Out of curiosity to see if it would help, I just now tried to change over to using my own ~/.ssh public and private keys like ray uses on Azure (thus, adding two lines in the ubuntu user ssh part of the cluster yaml, and adding the public key to file_mounts)… it didn’t work on GCP (but it does on Azure).

Here is the error message for that case. It fails early on.

    Running `uptime`
      Full command is `ssh -tt -i ~/.ssh/id_rsa -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=10s -o ConnectTimeout=5s ubuntu@34.69.71.176 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
Warning: Permanently added '34.69.71.176' (ECDSA) to the list of known hosts.
ubuntu@34.69.71.176: Permission denied (publickey).
    SSH still not available (SSH command failed.), retrying in 5 seconds.
^C
Aborted!

Is it possible that the controlpersist parameter is causing the timeout?

Can you try:

-ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o ... ControlPersist=10s -o ConnectTimeout=120s ubuntu@34.69.71.176 bash...
+ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o ... ControlPersist=500s -o ConnectTimeout=120s ubuntu@34.69.71.176 bash...

Notice that I just changed the controlpersist parameter. Btw, thanks a bunch for posting these logs! I feel like if we can get to the bottom of this, we’d make the GCP experience much better for the ray community.

Ended up doing this after I drove to Houma because I was on the phone. Due to problems with reproducibility (which I still have, but…), I modified the ray source file that sets the ControlPersist parameter so that it is always 500. It still hit the update-failed. After that, I re-ran the ssh that failed with -vvv as requested. Now I can sleep in my car at my house while I wait for the insurance field adjuster :slight_smile:

(env) chris_chiasson@penguin:~/te$ ray up -vvv --no-config-cache gpu-cluster-full.yaml 
Cluster: default

Checking GCP environment settings
2021-09-08 04:11:01,762 INFO config.py:451 -- _configure_key_pair: Private key not specified in config, using/home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem
No head node found. Launching a new cluster. Confirm [y/N]: y

Acquiring an up-to-date head node
2021-09-08 04:11:07,752 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1631092267504-5cb7843335347-08e248d4-19e374e4 to finish...
2021-09-08 04:11:40,314 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1631092267504-5cb7843335347-08e248d4-19e374e4 finished.
  Launched a new head node
  Fetching the new head node
  
<1/1> Setting up head node
  Prepared bootstrap config
2021-09-08 04:11:41,680 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1631092302429-5cb7845483c49-b7f269fa-9ced7cd9 to finish...
2021-09-08 04:11:47,340 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1631092302429-5cb7845483c49-b7f269fa-9ced7cd9 finished.
  New status: waiting-for-ssh
  [1/7] Waiting for SSH to become available
    Running `uptime` as a test.
    Fetched IP: 34.133.193.66
    Running `uptime`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=500s -o ConnectTimeout=5s ubuntu@34.133.193.66 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
ssh: connect to host 34.133.193.66 port 22: Connection refused
    SSH still not available (SSH command failed.), retrying in 5 seconds.
    Running `uptime`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=500s -o ConnectTimeout=5s ubuntu@34.133.193.66 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (uptime)'`
Warning: Permanently added '34.133.193.66' (ECDSA) to the list of known hosts.
 09:12:10 up 0 min,  1 user,  load average: 1.79, 0.44, 0.15
Shared connection to 34.133.193.66 closed.
    Success.
  Updating cluster configuration. [hash=a31be92a0bb45f557d45bc4c2e2b574999835faf]
2021-09-08 04:12:11,151 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1631092331650-5cb7847061d97-05a68852-214b39d0 to finish...
2021-09-08 04:12:16,793 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1631092331650-5cb7847061d97-05a68852-214b39d0 finished.
  New status: syncing-files
  [2/7] Processing file mounts
    Running `mkdir -p /tmp/ray_tmp_mount/default/~/.ssh && chown -R ubuntu /tmp/ray_tmp_mount/default/~/.ssh`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=500s -o ConnectTimeout=120s ubuntu@34.133.193.66 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p /tmp/ray_tmp_mount/default/~/.ssh && chown -R ubuntu /tmp/ray_tmp_mount/default/~/.ssh)'`
Shared connection to 34.133.193.66 closed.
    Running `rsync --rsh ssh -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=500s -o ConnectTimeout=120s -avz --exclude **/.git --exclude **/.git/** --filter dir-merge,- .gitignore /home/chris_chiasson/.ssh/id_rsa.pub ubuntu@34.133.193.66:/tmp/ray_tmp_mount/default/~/.ssh/id_rsa.pub`
sending incremental file list
id_rsa.pub

sent 720 bytes  received 35 bytes  503.33 bytes/sec
total size is 750  speedup is 0.99
    Running `docker inspect -f '{{.State.Running}}' ray_nvidia_docker || true`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=500s -o ConnectTimeout=120s ubuntu@34.133.193.66 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker inspect -f '"'"'{{.State.Running}}'"'"' ray_nvidia_docker || true)'`
Shared connection to 34.133.193.66 closed.
    `rsync`ed /home/chris_chiasson/.ssh/id_rsa.pub (local) to ~/.ssh/id_rsa.pub (remote)
    ~/.ssh/id_rsa.pub from /home/chris_chiasson/.ssh/id_rsa.pub
    Running `mkdir -p /tmp/ray_tmp_mount/default/~/json && chown -R ubuntu /tmp/ray_tmp_mount/default/~/json`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=500s -o ConnectTimeout=120s ubuntu@34.133.193.66 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p /tmp/ray_tmp_mount/default/~/json && chown -R ubuntu /tmp/ray_tmp_mount/default/~/json)'`
Shared connection to 34.133.193.66 closed.
    Running `rsync --rsh ssh -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=500s -o ConnectTimeout=120s -avz --exclude **/.git --exclude **/.git/** --filter dir-merge,- .gitignore ./json/ting-1-f3bd25cf2f05.json ubuntu@34.133.193.66:/tmp/ray_tmp_mount/default/~/json/ting-1-f3bd25cf2f05.json`
sending incremental file list
ting-1-f3bd25cf2f05.json

sent 1,764 bytes  received 35 bytes  3,598.00 bytes/sec
total size is 2,309  speedup is 1.28
    Running `docker inspect -f '{{.State.Running}}' ray_nvidia_docker || true`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=500s -o ConnectTimeout=120s ubuntu@34.133.193.66 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker inspect -f '"'"'{{.State.Running}}'"'"' ray_nvidia_docker || true)'`
Shared connection to 34.133.193.66 closed.
    `rsync`ed ./json/ting-1-f3bd25cf2f05.json (local) to ~/json/ting-1-f3bd25cf2f05.json (remote)
    ~/json/ting-1-f3bd25cf2f05.json from ./json/ting-1-f3bd25cf2f05.json
    Running `mkdir -p /tmp/ray_tmp_mount/default/~ && chown -R ubuntu /tmp/ray_tmp_mount/default/~`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=500s -o ConnectTimeout=120s ubuntu@34.133.193.66 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p /tmp/ray_tmp_mount/default/~ && chown -R ubuntu /tmp/ray_tmp_mount/default/~)'`
Shared connection to 34.133.193.66 closed.
    Running `rsync --rsh ssh -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=500s -o ConnectTimeout=120s -avz --exclude **/.git --exclude **/.git/** --filter dir-merge,- .gitignore /tmp/ray-bootstrap-jz54j3do ubuntu@34.133.193.66:/tmp/ray_tmp_mount/default/~/ray_bootstrap_config.yaml`
sending incremental file list
ray-bootstrap-jz54j3do

sent 1,312 bytes  received 35 bytes  2,694.00 bytes/sec
total size is 3,397  speedup is 2.52
    Running `docker inspect -f '{{.State.Running}}' ray_nvidia_docker || true`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=500s -o ConnectTimeout=120s ubuntu@34.133.193.66 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker inspect -f '"'"'{{.State.Running}}'"'"' ray_nvidia_docker || true)'`
Shared connection to 34.133.193.66 closed.
    `rsync`ed /tmp/ray-bootstrap-jz54j3do (local) to ~/ray_bootstrap_config.yaml (remote)
    ~/ray_bootstrap_config.yaml from /tmp/ray-bootstrap-jz54j3do
    Running `mkdir -p /tmp/ray_tmp_mount/default/~ && chown -R ubuntu /tmp/ray_tmp_mount/default/~`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=500s -o ConnectTimeout=120s ubuntu@34.133.193.66 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (mkdir -p /tmp/ray_tmp_mount/default/~ && chown -R ubuntu /tmp/ray_tmp_mount/default/~)'`
Shared connection to 34.133.193.66 closed.
    Running `rsync --rsh ssh -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=500s -o ConnectTimeout=120s -avz --exclude **/.git --exclude **/.git/** --filter dir-merge,- .gitignore /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem ubuntu@34.133.193.66:/tmp/ray_tmp_mount/default/~/ray_bootstrap_key.pem`
sending incremental file list
ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem

sent 1,461 bytes  received 35 bytes  997.33 bytes/sec
total size is 1,679  speedup is 1.12
    Running `docker inspect -f '{{.State.Running}}' ray_nvidia_docker || true`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=500s -o ConnectTimeout=120s ubuntu@34.133.193.66 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker inspect -f '"'"'{{.State.Running}}'"'"' ray_nvidia_docker || true)'`
Shared connection to 34.133.193.66 closed.
    `rsync`ed /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem (local) to ~/ray_bootstrap_key.pem (remote)
    ~/ray_bootstrap_key.pem from /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem
  [3/7] No worker file mounts to sync
2021-09-08 04:12:30,931 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1631092351515-5cb7848353d3c-d0b1d99f-2610168d to finish...
2021-09-08 04:12:36,594 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1631092351515-5cb7848353d3c-d0b1d99f-2610168d finished.
  New status: setting-up
  [4/7] No initialization commands to run.
  [5/7] Initalizing command runner
    Running `command -v docker || echo 'NoExist'`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=500s -o ConnectTimeout=120s ubuntu@34.133.193.66 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (command -v docker || echo '"'"'NoExist'"'"')'`
Shared connection to 34.133.193.66 closed.
    Running `docker pull rayproject/ray-ml:latest-gpu`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=500s -o ConnectTimeout=120s ubuntu@34.133.193.66 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker pull rayproject/ray-ml:latest-gpu)'`
latest-gpu: Pulling from rayproject/ray-ml
feac53061382: Pull complete 
4e74d880cc84: Pull complete 
b88d0e36cef0: Pull complete 
a4a81bdd364a: Pull complete 
f34588647641: Pull complete 
cb625e653a5b: Pull complete 
8361034ddb18: Pull complete 
15971bbf4cfa: Pull complete 
71811213304a: Pull complete 
ff6d090c8a0a: Pull complete 
54e1344d56f5: Pull complete 
2324cc0fe838: Pull complete 
b5ce686a03b1: Pull complete 
4dea84a783ac: Pull complete 
fdee1378325f: Pull complete 
6c53bcd05a46: Pull complete 
2d6f22c6ab1a: Pull complete 
d124bc6f7552: Pull complete 
f718cbf90ebe: Pull complete 
9490fa75f515: Pull complete 
0780019b2838: Pull complete 
59293d71a1fc: Pull complete 
c4623a7ed5da: Pull complete 
Digest: sha256:6342f87c9fff0e8391ccf6b491c10d18d42e76c6c06210b4cdbb04ceec03e201
Status: Downloaded newer image for rayproject/ray-ml:latest-gpu
docker.io/rayproject/ray-ml:latest-gpu
Shared connection to 34.133.193.66 closed.
    Running `docker inspect -f '{{.State.Running}}' ray_nvidia_docker || true`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=500s -o ConnectTimeout=120s ubuntu@34.133.193.66 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker inspect -f '"'"'{{.State.Running}}'"'"' ray_nvidia_docker || true)'`
Shared connection to 34.133.193.66 closed.
    Running `docker inspect -f '{{json .Config.Env}}' rayproject/ray-ml:latest-gpu`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=500s -o ConnectTimeout=120s ubuntu@34.133.193.66 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker inspect -f '"'"'{{json .Config.Env}}'"'"' rayproject/ray-ml:latest-gpu)'`
Shared connection to 34.133.193.66 closed.
    Running `cat /proc/meminfo || true`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=500s -o ConnectTimeout=120s ubuntu@34.133.193.66 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (cat /proc/meminfo || true)'`
Shared connection to 34.133.193.66 closed.
    Running `docker info -f '{{.Runtimes}}' `
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=500s -o ConnectTimeout=120s ubuntu@34.133.193.66 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker info -f '"'"'{{.Runtimes}}'"'"' )'`
Shared connection to 34.133.193.66 closed.
    Running `nvidia-smi`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=500s -o ConnectTimeout=120s ubuntu@34.133.193.66 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (nvidia-smi)'`
Wed Sep  8 09:17:01 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 460.73.01    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   43C    P0    60W / 149W |      3MiB / 11441MiB |     65%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
Shared connection to 34.133.193.66 closed.
    Running `docker run --rm --name ray_nvidia_docker -d -it -v /tmp/ray_tmp_mount/default/~/.ssh/id_rsa.pub:/home/ray/.ssh/id_rsa.pub -v /tmp/ray_tmp_mount/default/~/json/ting-1-f3bd25cf2f05.json:/home/ray/json/ting-1-f3bd25cf2f05.json -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --shm-size='17729594572.800003b' --runtime=nvidia --net=host rayproject/ray-ml:latest-gpu bash`
      Full command is `ssh -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=500s -o ConnectTimeout=120s ubuntu@34.133.193.66 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker run --rm --name ray_nvidia_docker -d -it -v /tmp/ray_tmp_mount/default/~/.ssh/id_rsa.pub:/home/ray/.ssh/id_rsa.pub -v /tmp/ray_tmp_mount/default/~/json/ting-1-f3bd25cf2f05.json:/home/ray/json/ting-1-f3bd25cf2f05.json -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --shm-size='"'"'17729594572.800003b'"'"' --runtime=nvidia --net=host rayproject/ray-ml:latest-gpu bash)'`
Shared connection to 34.133.193.66 closed.
2021-09-08 04:17:01,704 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1631092622441-5cb78585b3ba3-815b7b64-55bf23a2 to finish...
2021-09-08 04:17:07,615 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1631092622441-5cb78585b3ba3-815b7b64-55bf23a2 finished.
  New status: update-failed
  !!!
  {'message': 'SSH command failed.'}
  SSH command failed.
  !!!
  
  Failed to setup head node.

Here is the ssh -vvv rerun. (My previous post would have been over the character limit if it contained this.)

(env) chris_chiasson@penguin:~/te$ ssh -vvv -tt -i /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_8ec3601252/c21f969b5f/%C -o ControlPersist=500s -o ConnectTimeout=120s ubuntu@34.133.193.66 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker run --rm --name ray_nvidia_docker -d -it -v /tmp/ray_tmp_mount/default/~/.ssh/id_rsa.pub:/home/ray/.ssh/id_rsa.pub -v /tmp/ray_tmp_mount/default/~/json/t-3-f3bd25cf2f05.json:/home/ray/json/t-3-f3bd25cf2f05.json -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --shm-size='"'"'17729594572.800003b'"'"' --runtime=nvidia --net=host rayproject/ray-ml:latest-gpu bash)'
OpenSSH_7.9p1 Debian-10+deb10u2, OpenSSL 1.1.1d  10 Sep 2019
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug2: resolve_canonicalize: hostname 34.133.193.66 is address
debug1: auto-mux: Trying existing master
debug1: Control socket "/tmp/ray_ssh_8ec3601252/c21f969b5f/0b507d1b357341a1184a70c0a986aac8e47465be" does not exist
debug2: ssh_connect_direct
debug1: Connecting to 34.133.193.66 [34.133.193.66] port 22.
debug2: fd 3 setting O_NONBLOCK
debug1: fd 3 clearing O_NONBLOCK
debug1: Connection established.
debug3: timeout: 119777 ms remain after connect
debug1: identity file /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem type -1
debug1: identity file /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem-cert type -1
debug1: Local version string SSH-2.0-OpenSSH_7.9p1 Debian-10+deb10u2
debug1: Remote protocol version 2.0, remote software version OpenSSH_7.9p1 Debian-10+deb10u2
debug1: match: OpenSSH_7.9p1 Debian-10+deb10u2 pat OpenSSH* compat 0x04000000
debug2: fd 3 setting O_NONBLOCK
debug1: Authenticating to 34.133.193.66:22 as 'ubuntu'
debug3: hostkeys_foreach: reading file "/dev/null"
debug3: send packet: type 20
debug1: SSH2_MSG_KEXINIT sent
debug3: receive packet: type 20
debug1: SSH2_MSG_KEXINIT received
debug2: local client KEXINIT proposal
debug2: KEX algorithms: curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group14-sha256,diffie-hellman-group14-sha1,ext-info-c
debug2: host key algorithms: ecdsa-sha2-nistp256-cert-v01@openssh.com,ecdsa-sha2-nistp384-cert-v01@openssh.com,ecdsa-sha2-nistp521-cert-v01@openssh.com,ssh-ed25519-cert-v01@openssh.com,rsa-sha2-512-cert-v01@openssh.com,rsa-sha2-256-cert-v01@openssh.com,ssh-rsa-cert-v01@openssh.com,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,ssh-ed25519,rsa-sha2-512,rsa-sha2-256,ssh-rsa
debug2: ciphers ctos: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com
debug2: ciphers stoc: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com
debug2: MACs ctos: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1
debug2: MACs stoc: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1
debug2: compression ctos: none,zlib@openssh.com,zlib
debug2: compression stoc: none,zlib@openssh.com,zlib
debug2: languages ctos: 
debug2: languages stoc: 
debug2: first_kex_follows 0 
debug2: reserved 0 
debug2: peer server KEXINIT proposal
debug2: KEX algorithms: curve25519-sha256,curve25519-sha256@libssh.org,ecdh-sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256,diffie-hellman-group16-sha512,diffie-hellman-group18-sha512,diffie-hellman-group14-sha256,diffie-hellman-group14-sha1
debug2: host key algorithms: rsa-sha2-512,rsa-sha2-256,ssh-rsa,ecdsa-sha2-nistp256,ssh-ed25519
debug2: ciphers ctos: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com
debug2: ciphers stoc: chacha20-poly1305@openssh.com,aes128-ctr,aes192-ctr,aes256-ctr,aes128-gcm@openssh.com,aes256-gcm@openssh.com
debug2: MACs ctos: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1
debug2: MACs stoc: umac-64-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-sha2-512-etm@openssh.com,hmac-sha1-etm@openssh.com,umac-64@openssh.com,umac-128@openssh.com,hmac-sha2-256,hmac-sha2-512,hmac-sha1
debug2: compression ctos: none,zlib@openssh.com
debug2: compression stoc: none,zlib@openssh.com
debug2: languages ctos: 
debug2: languages stoc: 
debug2: first_kex_follows 0 
debug2: reserved 0 
debug1: kex: algorithm: curve25519-sha256
debug1: kex: host key algorithm: ecdsa-sha2-nistp256
debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug3: send packet: type 30
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug3: receive packet: type 31
debug1: Server host key: ecdsa-sha2-nistp256 SHA256:NDRk+hZ9ylnWHnL4YNdJTAmFJmboHE0wyvh1pS3sEHA
debug3: hostkeys_foreach: reading file "/dev/null"
Warning: Permanently added '34.133.193.66' (ECDSA) to the list of known hosts.
debug3: send packet: type 21
debug2: set_newkeys: mode 1
debug1: rekey after 134217728 blocks
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug3: receive packet: type 21
debug1: SSH2_MSG_NEWKEYS received
debug2: set_newkeys: mode 0
debug1: rekey after 134217728 blocks
debug1: Will attempt key: /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem  explicit
debug2: pubkey_prepare: done
debug3: send packet: type 5
debug3: receive packet: type 7
debug1: SSH2_MSG_EXT_INFO received
debug1: kex_input_ext_info: server-sig-algs=<ssh-ed25519,ssh-rsa,rsa-sha2-256,rsa-sha2-512,ssh-dss,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521>
debug3: receive packet: type 6
debug2: service_accept: ssh-userauth
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug3: send packet: type 50
debug3: receive packet: type 51
debug1: Authentications that can continue: publickey
debug3: start over, passed a different list publickey
debug3: preferred gssapi-keyex,gssapi-with-mic,publickey,keyboard-interactive,password
debug3: authmethod_lookup publickey
debug3: remaining preferred: keyboard-interactive,password
debug3: authmethod_is_enabled publickey
debug1: Next authentication method: publickey
debug1: Trying private key: /home/chris_chiasson/.ssh/ray-autoscaler_gcp_us-central1_t-3_ubuntu_0.pem
debug3: sign_and_send_pubkey: RSA SHA256:WkudYX4AaTrHi2TTs1iwbISXMBXTMla66aMhN4UDUbo
debug3: sign_and_send_pubkey: signing using rsa-sha2-512
debug3: send packet: type 50
debug2: we sent a publickey packet, wait for reply
debug3: receive packet: type 52
debug1: Authentication succeeded (publickey).
Authenticated to 34.133.193.66 ([34.133.193.66]:22).
debug1: setting up multiplex master socket
debug3: muxserver_listen: temporary control path /tmp/ray_ssh_8ec3601252/c21f969b5f/0b507d1b357341a1184a70c0a986aac8e47465be.5GFhiwhDc3s6cMb0
debug2: fd 4 setting O_NONBLOCK
debug3: fd 4 is O_NONBLOCK
debug3: fd 4 is O_NONBLOCK
debug1: channel 0: new [/tmp/ray_ssh_8ec3601252/c21f969b5f/0b507d1b357341a1184a70c0a986aac8e47465be]
debug3: muxserver_listen: mux listener channel 0 fd 4
debug2: fd 3 setting TCP_NODELAY
debug3: ssh_packet_set_tos: set IP_TOS 0x08
debug1: control_persist_detach: backgrounding master process
debug2: control_persist_detach: background process is 5860
debug2: fd 4 setting O_NONBLOCK
debug1: forking to background
debug1: Entering interactive session.
debug1: pledge: id
debug2: set_control_persist_exit_time: schedule exit in 500 seconds
debug1: multiplexing control connection
debug2: fd 5 setting O_NONBLOCK
debug3: fd 5 is O_NONBLOCK
debug1: channel 1: new [mux-control]
debug3: channel_post_mux_listener: new mux channel 1 fd 5
debug3: mux_master_read_cb: channel 1: hello sent
debug2: set_control_persist_exit_time: cancel scheduled exit
debug3: mux_master_read_cb: channel 1 packet type 0x00000001 len 4
debug2: mux_master_process_hello: channel 1 slave version 4
debug2: mux_client_hello_exchange: master version 4
debug3: mux_client_forwards: request forwardings: 0 local, 0 remote
debug3: mux_client_request_session: entering
debug3: mux_client_request_alive: entering
debug3: mux_master_read_cb: channel 1 packet type 0x10000004 len 4
debug2: mux_master_process_alive_check: channel 1: alive check
debug3: mux_client_request_alive: done pid = 5862
debug3: mux_client_request_session: session request sent
debug3: mux_master_read_cb: channel 1 packet type 0x10000002 len 527
debug2: mux_master_process_new_session: channel 1: request tty 1, X 0, agent 0, subsys 0, term "xterm-256color", cmd "bash --login -c -i true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker run --rm --name ray_nvidia_docker -d -it -v /tmp/ray_tmp_mount/default/~/.ssh/id_rsa.pub:/home/ray/.ssh/id_rsa.pub -v /tmp/ray_tmp_mount/default/~/json/t-3-f3bd25cf2f05.json:/home/ray/json/t-3-f3bd25cf2f05.json -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --shm-size='17729594572.800003b' --runtime=nvidia --net=host rayproject/ray-ml:latest-gpu bash)", env 1
debug3: mux_master_process_new_session: got fds stdin 6, stdout 7, stderr 8
debug1: channel 2: new [client-session]
debug2: mux_master_process_new_session: channel_new: 2 linked to control channel 1
debug2: channel 2: send open
debug3: send packet: type 90
debug3: receive packet: type 80
debug1: client_input_global_request: rtype hostkeys-00@openssh.com want_reply 0
debug3: receive packet: type 4
debug1: Remote: /home/ubuntu/.ssh/authorized_keys:6: key options: agent-forwarding port-forwarding pty user-rc x11-forwarding
debug3: receive packet: type 91
debug2: channel_input_open_confirmation: channel 2: callback start
debug2: client_session2_setup: id 2
debug2: channel 2: request pty-req confirm 1
debug3: send packet: type 98
debug1: Sending environment.
debug1: Sending env LANG = en_US.UTF-8
debug2: channel 2: request env confirm 0
debug3: send packet: type 98
debug1: Sending command: bash --login -c -i true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker run --rm --name ray_nvidia_docker -d -it -v /tmp/ray_tmp_mount/default/~/.ssh/id_rsa.pub:/home/ray/.ssh/id_rsa.pub -v /tmp/ray_tmp_mount/default/~/json/t-3-f3bd25cf2f05.json:/home/ray/json/t-3-f3bd25cf2f05.json -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --shm-size='17729594572.800003b' --runtime=nvidia --net=host rayproject/ray-ml:latest-gpu bash)
debug2: channel 2: request exec confirm 1
debug3: send packet: type 98
debug3: mux_session_confirm: sending success reply
debug2: channel_input_open_confirmation: channel 2: callback done
debug2: channel 2: open confirm rwindow 0 rmax 32768
debug3: receive packet: type 99
debug2: channel_input_status_confirm: type 99 id 2
debug2: PTY allocation request accepted on channel 2
debug2: channel 2: rcvd adjust 2097152
debug3: receive packet: type 99
debug2: channel_input_status_confirm: type 99 id 2
debug2: exec request accepted on channel 2
3c754f212f2b2a26b7229313dbb0f0abaa5e9ff087e660020ed2742a24a8531e
debug3: receive packet: type 98
debug1: client_input_channel_req: channel 2 rtype exit-status reply 0
debug3: mux_exit_message: channel 2: exit message, exitval 0
debug3: receive packet: type 98
debug1: client_input_channel_req: channel 2 rtype eow@openssh.com reply 0
debug2: channel 2: rcvd eow
debug2: channel 2: chan_shutdown_read (i0 o0 sock -1 wfd 6 efd 8 [write])
debug2: channel 2: input open -> closed
debug3: receive packet: type 96
debug2: channel 2: rcvd eof
debug2: channel 2: output open -> drain
debug2: channel 2: obuf empty
debug2: channel 2: chan_shutdown_write (i3 o1 sock -1 wfd 7 efd 8 [write])
debug2: channel 2: output drain -> closed
debug3: receive packet: type 97
debug2: channel 2: rcvd close
debug3: channel 2: will not send data after close
debug2: channel 2: send close
debug3: send packet: type 97
debug2: channel 2: is dead
debug2: channel 2: gc: notify user
debug3: mux_master_session_cleanup_cb: entering for channel 2
debug2: channel 1: rcvd close
debug2: channel 1: output open -> drain
debug2: channel 1: chan_shutdown_read (i0 o1 sock 5 wfd 5 efd -1 [closed])
debug2: channel 1: input open -> closed
debug2: channel 2: gc: user detached
debug2: channel 2: is dead
debug2: channel 2: garbage collecting
debug1: channel 2: free: client-session, nchannels 3
debug3: channel 2: status: The following connections are open:
  #1 mux-control (t16 nr0 i3/0 o1/16 e[closed]/0 fd 5/5/-1 sock 5 cc -1)
  #2 client-session (t4 r0 i3/0 o3/0 e[write]/0 fd -1/-1/8 sock -1 cc -1)

debug2: channel 1: obuf empty
debug2: channel 1: chan_shutdown_write (i3 o1 sock 5 wfd 5 efd -1 [closed])
debug3: mux_client_read_packet: read header failed: Broken pipe
debug2: Received exit status from master 0
Shared connection to 34.133.193.66 closed.
debug2: channel 1: output drain -> closed
debug2: channel 1: is dead (local)
debug2: channel 1: gc: notify user
debug3: mux_master_control_cleanup_cb: entering for channel 1
debug2: channel 1: gc: user detached
debug2: channel 1: is dead (local)
debug2: channel 1: garbage collecting
debug1: channel 1: free: mux-control, nchannels 2
debug3: channel 1: status: The following connections are open:
  #1 mux-control (t16 nr0 i3/0 o3/0 e[closed]/0 fd 5/5/-1 sock 5 cc -1)

debug2: set_control_persist_exit_time: schedule exit in 500 seconds
(env) chris_chiasson@penguin:~/te$ debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
(env) chris_chiasson@penguin:~/te$ debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
(env) chris_chiasson@penguin:~/te$ debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82
debug3: send packet: type 80
debug3: receive packet: type 82