Ray cluster does is not creating workers?

Yoav · February 13, 2021, 8:32pm

I am trying to set up a ray cluster on gcp, and while I manage to create the head node, it does not seem to create the workers.

when i run monitor, i get:

Warning: Permanently added '34.83.244.6' (ECDSA) to the list of known hosts.
==> /tmp/ray/session_latest/logs/monitor.err <==
Error in sys.excepthook:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 852, in custom_excepthook
    worker_id = global_worker.worker_id
AttributeError: 'Worker' object has no attribute 'worker_id'

Original exception was:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/monitor.py", line 356, in <module>
    redis_password=args.redis_password)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/monitor.py", line 112, in __init__
    event_summarizer=self.event_summarizer)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 87, in __init__
    self.reset(errors_fatal=True)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 518, in reset
    raise e
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 448, in reset
    with open(self.config_path) as f:
PermissionError: [Errno 13] Permission denied: '/home/ray/ray_bootstrap_config.yaml'

==> /tmp/ray/session_latest/logs/monitor.log <==

==> /tmp/ray/session_latest/logs/monitor.out <==
^C
Shared connection to 34.83.244.6 closed.
Error: Command failed:

  ssh -tt -i /Users/yogo/.ssh/ray-autoscaler_gcp_us-west1_blabla.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_74c766be1f/a02f87ff34/%C -o ControlPersist=10s -o ConnectTimeout=120s ubuntu@34.83.244.6 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (docker exec -it  ray_container /bin/bash -c '"'"'bash --login -c -i '"'"'"'"'"'"'"'"'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (tail -n 100 -f /tmp/ray/session_latest/logs/monitor*)'"'"'"'"'"'"'"'"''"'"' )'

any tips as to how to debug this / what can be going wrong?

rliaw · February 14, 2021, 8:58am

Hey @Yoav thanks a bunch for reaching out!

Are you using your own docker image? Perhaps one thing you could do is check + adjust permissions to /home/ray/ray_bootstrap_config.yaml in your setup_commands in the Ray cluster configuration.

It would also be really helpful if you could post some snippets of your cluster configuration.

cc @ijrsvt

Yoav · February 14, 2021, 11:17pm

Thanks! I am not using my own docker image, here is my config:

# An unique identifier for the head node and workers of this cluster.
cluster_name: clusty

initial_workers: 5
min_workers: 5
max_workers: 1000

upscaling_speed: 1.0

docker:
   image: "rayproject/ray:latest-cpu" 
   container_name: "ray_container"
   pull_before_run: True
   run_options: []  # Extra options to pass into "docker run"

idle_timeout_minutes: 5

provider:
    type: gcp
    region: us-west1
    availability_zone: us-west1-a
    project_id: OUR_ID

auth:
    ssh_user: ubuntu

head_node:
    tags:
      - items: ["allow-all"]
    machineType: n1-standard-2
    disks:
      - boot: true
        autoDelete: true
        type: PERSISTENT
        initializeParams:
          diskSizeGb: 50
          # See https://cloud.google.com/compute/docs/images for more images
          sourceImage: projects/deeplearning-platform-release/global/images/family/tf-1-13-cpu

worker_nodes:
    tags:
      - items: ["allow-all"]
    machineType: n1-standard-2
    disks:
      - boot: true
        autoDelete: true
        type: PERSISTENT
        initializeParams:
          diskSizeGb: 10
          # See https://cloud.google.com/compute/docs/images for more images
          sourceImage: projects/deeplearning-platform-release/global/images/family/tf-1-13-cpu
    # Run workers on preemtible instance by default.
    # Comment this out to use on-demand.
    #scheduling:
    #  - preemptible: true

file_mounts: {
    "/home/ray/dist/": "./dist"
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up nodes.
setup_commands:
  - pip3 install /home/ray/dist/*
  - pip3 install smart_open[gcp] pathy[gcs]


# Custom commands that will be run on the head node after common setup.
head_setup_commands:
  - pip3 install google-api-python-client==1.7.8

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

rliaw · February 15, 2021, 12:29am

Yoav:

# An unique identifier for the head node and workers of this cluster.
cluster_name: clusty

initial_workers: 5
min_workers: 5
max_workers: 1000

upscaling_speed: 1.0

docker:
   image: "rayproject/ray:latest-cpu" 
   container_name: "ray_container"
   pull_before_run: True
   run_options: []  # Extra options to pass into "docker run"

idle_timeout_minutes: 5

provider:
    type: gcp
    region: us-west1
    availability_zone: us-west1-a
    project_id: OUR_ID

auth:
    ssh_user: ubuntu

head_node:
    tags:
      - items: ["allow-all"]
    machineType: n1-standard-2
    disks:
      - boot: true
        autoDelete: true
        type: PERSISTENT
        initializeParams:
          diskSizeGb: 50
          # See https://cloud.google.com/compute/docs/images for more images
          sourceImage: projects/deeplearning-platform-release/global/images/family/tf-1-13-cpu

worker_nodes:
    tags:
      - items: ["allow-all"]
    machineType: n1-standard-2
    disks:
      - boot: true
        autoDelete: true
        type: PERSISTENT
        initializeParams:
          diskSizeGb: 10
          # See https://cloud.google.com/compute/docs/images for more images
          sourceImage: projects/deeplearning-platform-release/global/images/family/tf-1-13-cpu
    # Run workers on preemtible instance by default.
    # Comment this out to use on-demand.
    #scheduling:
    #  - preemptible: true

file_mounts: {
    "/home/ray/dist/": "./dist"
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
    - "**/.git"
    - "**/.git/**"

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
    - ".gitignore"

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up nodes.
setup_commands:
  - pip3 install /home/ray/dist/*
  - pip3 install smart_open[gcp] pathy[gcs]


# Custom commands that will be run on the head node after common setup.
head_setup_commands:
  - pip3 install google-api-python-client==1.7.8

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

Hey @Yoav thanks for this repro yaml! I’m starting a machine on GCP now with it, and I will get back to you as soon as I reproduce the error.

rliaw · February 15, 2021, 12:33am

update: I can repro on our GCP. Will look around for a fix now

rliaw · February 15, 2021, 1:30am

Hey @Yoav, the following changes should enable the Ray cluster to start workers. Let me know if you run into any issues!

@@ -46,7 +46,7 @@ worker_nodes:
         autoDelete: true
         type: PERSISTENT
         initializeParams:
-          diskSizeGb: 10
+          diskSizeGb: 50  #  Requested disk size cannot be smaller than the image size (30 GB)"
           # See https://cloud.google.com/compute/docs/images for more images
           sourceImage: projects/deeplearning-platform-release/global/images/family/tf-1-13-cpu
     # Run workers on preemtible instance by default.
@@ -92,6 +92,8 @@ setup_commands:

 # Custom commands that will be run on the head node after common setup.
 head_setup_commands:
+  - sudo chown ray ~/ray_bootstrap_key.pem
+  - sudo chown ray ~/ray_bootstrap_config.yaml
   - pip3 install google-api-python-client==1.7.8

 # Custom commands that will be run on worker nodes after common setup.

Yoav · February 15, 2021, 10:15am

This is working!!
Thanks!

Yoav · February 19, 2021, 9:13pm

Oh. I am now getting this error again, and I do have the fix above in my .yaml file.

==> /tmp/ray/session_latest/logs/monitor.err <==
Error in sys.excepthook:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 852, in custom_excepthook
    worker_id = global_worker.worker_id
AttributeError: 'Worker' object has no attribute 'worker_id'

Original exception was:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/monitor.py", line 356, in <module>
    redis_password=args.redis_password)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/monitor.py", line 112, in __init__
    event_summarizer=self.event_summarizer)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 87, in __init__
    self.reset(errors_fatal=True)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 518, in reset
    raise e
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 448, in reset
    with open(self.config_path) as f:
PermissionError: [Errno 13] Permission denied: '/home/ray/ray_bootstrap_config.yaml'

==> /tmp/ray/session_latest/logs/monitor.log <==

One change is that I moved to a mostly docker-based setup, but, it did work with the docker-based setup for a while before, before starting to fail again.

Not sure how to investigate this further.

What I did notice, though, is the following:

It happens in an inconsistent way: sometimes a cluster fails to create workers, and sometimes it doesn’t.
I think it might be correlated with the number of tasks, ie, when my first scheduling on the cluster is for 2000 actors it sometimes results in the above error. Asking for 200 actors worked OK several times, with no errors. And then scheduling 200, having some workers up, stopping the job, and then running again with 2000, works fine.

rliaw · February 21, 2021, 8:42am

Could you try running ls -l /home/ray and posting the permissions of each file?

Yoav · February 21, 2021, 9:04am

I will do this next time this happens. This is not a consistent behavior, it happens only once in a while.

ijrsvt · February 26, 2021, 12:02am

Hey @Yoav It appears that the issue is that in the default sourceImage (basically the VM’s HostOS) is running as Ubuntu, but with UID = 1001. Inside the container, the running user (ray) uses UID = 1000. I’ll be working on a solution ASAP!

ijrsvt · February 26, 2021, 12:31am

@Yoav, A quick fix on your end is to use

sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu

in the head_node and worker_nodes configurations. If you are using GPUs, I’d suggest using:

sourceImage: projects/deeplearning-platform-release/global/images/family/common-cu110

Yoav · February 26, 2021, 10:12pm

great, thanks!
(actually, i was just using the image from the example file. when i tried to switch to a google container-optimized-os image instead, things broke. it wasn’t that important to me so i switched back to the example image, but this may explain that behavior as well…)

philippe-boyd-maxa · April 16, 2021, 12:54pm

With which version of Ray were you able to make this work? I tried the same yaml config with latest and nightly and only the head node is being created… I’ve been unable to make it work on GCP with the one in master even with the chown fix provided above…

With the config above i was able to bring up the 5 workers (legacy) but all are still in pending if I check the monitor. SSH command was successful, pip installs as well and then nothing…

Healthy:
 1 ray-legacy-head-node-type
Pending:
 <ip.1>: ray-legacy-worker-node-type, setting-up
 <ip.2>: ray-legacy-worker-node-type, setting-up
 <ip.3>: ray-legacy-worker-node-type, setting-up
 <ip.4>: ray-legacy-worker-node-type, setting-up
 <ip.5>: ray-legacy-worker-node-type, setting-up
Recent failures:
 (no failures)

rliaw · April 21, 2021, 7:49am

Hmm, @philippe-boyd-maxa are there any logs that indicate failure or progress? If you could post the output of /tmp/ray/session_latest/monitor* that would be helpful for me to debug.

philippe-boyd-maxa · April 22, 2021, 4:00pm

There’s nothing unusual in the logs, it just keeps going with. monitor.out and monitor.err doesn’t show any errors.

======== Autoscaler status: 2021-04-22 14:59:57.449422 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray_head_default
Pending:
 10.30.30.23: ray_worker, setting-up
 10.30.30.30: ray_worker, setting-up
Recent failures:
 (no failures)
Resources
---------------------------------------------------------------
Usage:
 0.00/2.246 GiB memory
 0.0/4.0 CPU
 0.00/0.732 GiB object_store_memory
Demands:
 (no resource demands)

Problem occurs when using sourceImage: projects/cos-cloud/global/images/family/cos-stable

Weird as it sound, if I create a VM on the same network, same network tags and all and just run a redis image on it, I’ll be able to connect to it from the worker. But I cannot connect to the redis server running on the head. (Connection Timeout)

If i use another source image such as projects/deeplearning-platform-release/global/images/family/common-cpu it works.

I’m running a docker container so anything else than COS is too bloated and unnecessary.

philippe-boyd-maxa · April 22, 2021, 4:32pm

Ok so I found the issue after going down the rabbit hole… When launching a Ray cluster, it uses docker to initialize the image on the nodes, even when you use a COS image. Underneath, Ray uses the same gcloud compute instances create for every node instead of calling gcloud compute instances create-with-container which is a special API for COS image. The benefit of using the latter and setting the image in the API call is that it will open all TCP, UDP, ICMP ports in iptables so you’re free to have any open ports from your docker image (which is super logical, the rest can be done through GCP’s firewall)

TLDR: when using source image COS (projects/cos-cloud/global/images/family/cos-stable) only port 22 will be open on your VM because Ray is using docker to pull the image but not making any changes to the iptables

philippe_boyd@ray-default-head-3650dc66 ~ $ sudo /sbin/iptables -L
Chain INPUT (policy DROP)
target     prot opt source               destination         
ACCEPT     all  --  anywhere             anywhere             state RELATED,ESTABLISHED
ACCEPT     all  --  anywhere             anywhere            
ACCEPT     icmp --  anywhere             anywhere            
ACCEPT     tcp  --  anywhere             anywhere             tcp dpt:ssh
ACCEPT     tcp  --  anywhere             anywhere      **this is what it's missing**

If we have the docker section in Ray cluster’s config, why would we want anything else than a COS image to handle it?

Workaround: the workaround for the time being is the following but this shouldn’t be a permanent solution:

initialization_commands:
    - sudo /sbin/iptables -A INPUT -p tcp -j ACCEPT

rliaw · April 22, 2021, 4:53pm

Yeah that’s a good point; I think we just were never aware of this! This is really interesting and seems like a valuable thing for Ray to do. Would you be open to pushing a pull-request against the Ray repository?

cc @ijrsvt

philippe-boyd-maxa · April 22, 2021, 6:40pm

Would love to, I just don’t have time yet to get deep into the code… Will try and submit something eventually

Igor · April 25, 2021, 1:45am

Hi @philippe-boyd-maxa. Did you try to install gpu drivers on the workers with cos images? Been having a strange bug, initilization_commands works fine on the head node, but they fail on the workers.

When using deep learning os images, the workers just hanged on a simple apt install command, the same command works just fine on the head node. I wonder if the workers are launched in a different manner even if they have the same node_config as the head node?

Then I switched to a cos image. Now the problem is that the worker fails to download the driver from a public url (wget) maybe because of the host firewall is blocking outgoing connections. I tried adding a few rules but has been unsuccessful so far.

Topic		Replies	Views
Worker nodes fail to setup container Ray Clusters	1	706	September 12, 2022
Ray workers can't ssh to head node Ray Core	5	760	June 14, 2022
Ray Worker pod stuck at init stage and unable to be created Ray Clusters	8	683	August 7, 2024
Cannot start Ray runtime on GCP Ray Core	12	839	March 3, 2021
Replicas can't connect to GPUs Ray Serve	9	1640	August 11, 2022

Ray cluster does is not creating workers?

Related topics