Cannot start Ray runtime on GCP

Running
ray up cluster.yaml

Getting the following:

[7/7] Starting the Ray runtime
Could not terminate **/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_2021-02-24_16-25-42_086193_4322/sockets/raylet --store_socket_name=/tmp/ray/session_2021-02-24_16-25-42_086193_4322/sockets/plasma_store --object_manager_port=8076 --min_worker_port=10000 --max_worker_port=10999 --node_manager_port=34998 --node_ip_address=10.138.0.2 --redis_address=10.138.0.2 --redis_port=6379 --num_initial_workers=2 --maximum_startup_concurrency=2 --static_resource_list=node:10.138.0.2,1.0,CPU,2,memory,87,object_store_memory,30 --config_list=plasma_store_as_thread,True "--python_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.138.0.2 --node-manager-port=34998 --object-store-name=/tmp/ray/session_2021-02-24_16-25-42_086193_4322/sockets/plasma_store --raylet-name=/tmp/ray/session_2021-02-24_16-25-42_086193_4322/sockets/raylet --redis-address=10.138.0.2:6379 --config-list=plasma_store_as_thread,True --temp-dir=/tmp/ray --metrics-agent-port=56424 --redis-password=5241590000000000" --java_worker_command= --cpp_worker_command= --redis_password=5241590000000000 --temp_dir=/tmp/ray --session_dir=/tmp/ray/session_2021-02-24_16-25-42_086193_4322 --metrics-agent-port=56424 --metrics_export_port=55267 "--agent_command=/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py --node-ip-address=10.138.0.2 --redis-address=10.138.0.2:6379 --metrics-export-port=55267 --dashboard-agent-port=56424 --node-manager-port=34998 --object-store-name=/tmp/ray/session_2021-02-24_16-25-42_086193_4322/sockets/plasma_store --raylet-name=/tmp/ray/session_2021-02-24_16-25-42_086193_4322/sockets/raylet --temp-dir=/tmp/ray --log-dir=/tmp/ray/session_2021-02-24_16-25-42_086193_4322/logs --redis-password=5241590000000000" --object_store_memory=2284963430 --plasma_directory=/dev/shm --head_node** due to psutil.AccessDenied (pid=6479, name=‘raylet’)

Could not terminate **/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/plasma/plasma_store_server -s /tmp/ray/session_2021-02-24_16-25-42_086193_4322/sockets/plasma_store -m 2284963430 -d /dev/shm -z** due to psutil.AccessDenied (pid=6478, name=‘plasma_store_server’)

Could not terminate **/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server --redis_address=10.138.0.2 --redis_port=6379 --config_list=plasma_store_as_thread,True --gcs_server_port=0 --metrics-agent-port=56424 --node-ip-address=10.138.0.2 --redis_password=5241590000000000** due to psutil.AccessDenied (pid=6457, name=‘gcs_server’)

Could not terminate **/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/log_monitor.py --redis-address=10.138.0.2:6379 --logs-dir=/tmp/ray/session_2021-02-24_16-25-42_086193_4322/logs --redis-password 5241590000000000** due to psutil.AccessDenied (pid=6480, name=‘/home/ray/anaco’)

Could not terminate **"/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:6379" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""** due to psutil.AccessDenied (pid=6446, name=‘redis-server’)

Could not terminate **"/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:17690" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""** due to psutil.AccessDenied (pid=6452, name=‘redis-server’)

Could not terminate **/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_2021-02-24_16-25-42_086193_4322/sockets/raylet --store_socket_name=/tmp/ray/session_2021-02-24_16-25-42_086193_4322/sockets/plasma_store --object_manager_port=8076 --min_worker_port=10000 --max_worker_port=10999 --node_manager_port=34998 --node_ip_address=10.138.0.2 --redis_address=10.138.0.2 --redis_port=6379 --num_initial_workers=2 --maximum_startup_concurrency=2 --static_resource_list=node:10.138.0.2,1.0,CPU,2,memory,87,object_store_memory,30 --config_list=plasma_store_as_thread,True "--python_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.138.0.2 --node-manager-port=34998 --object-store-name=/tmp/ray/session_2021-02-24_16-25-42_086193_4322/sockets/plasma_store --raylet-name=/tmp/ray/session_2021-02-24_16-25-42_086193_4322/sockets/raylet --redis-address=10.138.0.2:6379 --config-list=plasma_store_as_thread,True --temp-dir=/tmp/ray --metrics-agent-port=56424 --redis-password=5241590000000000" --java_worker_command= --cpp_worker_command= --redis_password=5241590000000000 --temp_dir=/tmp/ray --session_dir=/tmp/ray/session_2021-02-24_16-25-42_086193_4322 --metrics-agent-port=56424 --metrics_export_port=55267 "--agent_command=/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py --node-ip-address=10.138.0.2 --redis-address=10.138.0.2:6379 --metrics-export-port=55267 --dashboard-agent-port=56424 --node-manager-port=34998 --object-store-name=/tmp/ray/session_2021-02-24_16-25-42_086193_4322/sockets/plasma_store --raylet-name=/tmp/ray/session_2021-02-24_16-25-42_086193_4322/sockets/raylet --temp-dir=/tmp/ray --log-dir=/tmp/ray/session_2021-02-24_16-25-42_086193_4322/logs --redis-password=5241590000000000" --object_store_memory=2284963430 --plasma_directory=/dev/shm --head_node** due to psutil.AccessDenied (pid=6479, name=‘raylet’)

Could not terminate **/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/log_monitor.py --redis-address=10.138.0.2:6379 --logs-dir=/tmp/ray/session_2021-02-24_16-25-42_086193_4322/logs --redis-password 5241590000000000** due to psutil.AccessDenied (pid=6480, name=‘/home/ray/anaco’)

Could not terminate **/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/dashboard.py --host=localhost --port=8265 --redis-address=10.138.0.2:6379 --temp-dir=/tmp/ray --log-dir=/tmp/ray/session_2021-02-24_16-25-42_086193_4322/logs --redis-password 5241590000000000** due to psutil.AccessDenied (pid=6477, name=‘/home/ray/anaco’)

Could not terminate **/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_2021-02-24_16-25-42_086193_4322/sockets/raylet --store_socket_name=/tmp/ray/session_2021-02-24_16-25-42_086193_4322/sockets/plasma_store --object_manager_port=8076 --min_worker_port=10000 --max_worker_port=10999 --node_manager_port=34998 --node_ip_address=10.138.0.2 --redis_address=10.138.0.2 --redis_port=6379 --num_initial_workers=2 --maximum_startup_concurrency=2 --static_resource_list=node:10.138.0.2,1.0,CPU,2,memory,87,object_store_memory,30 --config_list=plasma_store_as_thread,True "--python_worker_command=/home/ray/anaconda3/bin/python /home/ray/anaconda3/lib/python3.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.138.0.2 --node-manager-port=34998 --object-store-name=/tmp/ray/session_2021-02-24_16-25-42_086193_4322/sockets/plasma_store --raylet-name=/tmp/ray/session_2021-02-24_16-25-42_086193_4322/sockets/raylet --redis-address=10.138.0.2:6379 --config-list=plasma_store_as_thread,True --temp-dir=/tmp/ray --metrics-agent-port=56424 --redis-password=5241590000000000" --java_worker_command= --cpp_worker_command= --redis_password=5241590000000000 --temp_dir=/tmp/ray --session_dir=/tmp/ray/session_2021-02-24_16-25-42_086193_4322 --metrics-agent-port=56424 --metrics_export_port=55267 "--agent_command=/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py --node-ip-address=10.138.0.2 --redis-address=10.138.0.2:6379 --metrics-export-port=55267 --dashboard-agent-port=56424 --node-manager-port=34998 --object-store-name=/tmp/ray/session_2021-02-24_16-25-42_086193_4322/sockets/plasma_store --raylet-name=/tmp/ray/session_2021-02-24_16-25-42_086193_4322/sockets/raylet --temp-dir=/tmp/ray --log-dir=/tmp/ray/session_2021-02-24_16-25-42_086193_4322/logs --redis-password=5241590000000000" --object_store_memory=2284963430 --plasma_directory=/dev/shm --head_node** due to psutil.AccessDenied (pid=6479, name=‘raylet’)

Could not terminate **/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/new_dashboard/agent.py --node-ip-address=10.138.0.2 --redis-address=10.138.0.2:6379 --metrics-export-port=55267 --dashboard-agent-port=56424 --node-manager-port=34998 --object-store-name=/tmp/ray/session_2021-02-24_16-25-42_086193_4322/sockets/plasma_store --raylet-name=/tmp/ray/session_2021-02-24_16-25-42_086193_4322/sockets/raylet --temp-dir=/tmp/ray --log-dir=/tmp/ray/session_2021-02-24_16-25-42_086193_4322/logs --redis-password=5241590000000000** due to psutil.AccessDenied (pid=6533, name=‘/home/ray/anaco’)

@rliaw have you seen this issue before?

It’s the first I’m trying to set up a cluster on GCP, so no, I haven’t.

Oh sorry. I was asking other guys haha (since I know he recently answer to the GCP related issue).

Also cc @Ameer_Haj_Ali

@kipnisal always feel free to ping me again if people don’t respond to you!

can you please paste the cluster yaml?
does it also happen on the ray nightly?

CC @ijrsvt, perhaps this is related to the other gpc issue?

Here is the cluster yaml:

cluster_name: default

min_workers: 3
max_workers: 4


provider:
    type: gcp
    region: us-west1
    availability_zone: us-west1-a
    project_id: slurm-experiment # Globally unique project id

auth:
    ssh_user: ubuntu

setup_commands:
  - sudo apt update
  - sudo apt-get install -y cmake
  - pip install ray[all] numpy pandas TwoSampleHC

Hi @sangcho,
I’m still struggling with this issue.
Are there any directions where to look?

Can you check if the ubuntu user has the access to this folder? ’/home/ray/anaco’

I don’t understand the question. What user? how can I check that?

I’m trying to follow the instructions on https://docs.ray.io/en/releases-0.8.5/autoscaling.html

Below is the output of a different attempt. This time I added

docker:
  image: "kipnisal/phase_transition"
  container_name: "phase_transition_exp"
  pull_before_run: True
  run_options: []  # Extra options to pass into "docker run"

to the cluster.yaml file and removed what was under `setup_commands’ (the Docker container includes instructions to install ray and the other requirements).

[6/7] Running setup commands

(0/3) **wget https://repo.anaconda.com** ...

bash: wget: command not found

bash: /root/anaconda3.sh: No such file or directory

rm: cannot remove '/root/anaconda3.sh': No such file or directory

Shared connection to 35.203.137.46 closed.

2021-03-01 19:57:08,101 INFO node_provider.py:20 -- wait_for_compute_zone_operation: Waiting for operation operation-1614657427526-5bc85ba66d52d-91feb0eb-0eabe74f to finish...

2021-03-01 19:57:13,843 INFO node_provider.py:32 -- wait_for_compute_zone_operation: Operation operation-1614657427526-5bc85ba66d52d-91feb0eb-0eabe74f finished.

New status: **update-failed**

!!!

SSH command failed.

!!!

Failed to setup head node.

zsh: exit 1 ray up cluster.yaml

What is unclear to me is why ray is looking for anaconda?

I asked due to this line;

If you launch the autoscaler, the node’s user is ubuntu. (Look at your cluster.yaml)

auth:
    ssh_user: ubuntu

And it seems like this ssh user doesn’t have access to the folder /home/ray/anaco in a glance.

Also, what version of ray are you using? Also, @Dmitri @Ameer_Haj_Ali can you guys handle this issue?

This is suboptimal behavior…
It’s happening because setup commands are getting pulled from here:

I’d actually recommend taking a look at this file as a starting point and modifying to suit your use-case:

Also, if you could post an issue on the Ray github with the last error message you saw, that would be very helpful.