Are you using your own docker image? Perhaps one thing you could do is check + adjust permissions to /home/ray/ray_bootstrap_config.yaml in your setup_commands in the Ray cluster configuration.
It would also be really helpful if you could post some snippets of your cluster configuration.
Thanks! I am not using my own docker image, here is my config:
# An unique identifier for the head node and workers of this cluster.
cluster_name: clusty
initial_workers: 5
min_workers: 5
max_workers: 1000
upscaling_speed: 1.0
docker:
image: "rayproject/ray:latest-cpu"
container_name: "ray_container"
pull_before_run: True
run_options: [] # Extra options to pass into "docker run"
idle_timeout_minutes: 5
provider:
type: gcp
region: us-west1
availability_zone: us-west1-a
project_id: OUR_ID
auth:
ssh_user: ubuntu
head_node:
tags:
- items: ["allow-all"]
machineType: n1-standard-2
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 50
# See https://cloud.google.com/compute/docs/images for more images
sourceImage: projects/deeplearning-platform-release/global/images/family/tf-1-13-cpu
worker_nodes:
tags:
- items: ["allow-all"]
machineType: n1-standard-2
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 10
# See https://cloud.google.com/compute/docs/images for more images
sourceImage: projects/deeplearning-platform-release/global/images/family/tf-1-13-cpu
# Run workers on preemtible instance by default.
# Comment this out to use on-demand.
#scheduling:
# - preemptible: true
file_mounts: {
"/home/ray/dist/": "./dist"
}
# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []
# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False
# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
- "**/.git"
- "**/.git/**"
# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
- ".gitignore"
# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []
# List of shell commands to run to set up nodes.
setup_commands:
- pip3 install /home/ray/dist/*
- pip3 install smart_open[gcp] pathy[gcs]
# Custom commands that will be run on the head node after common setup.
head_setup_commands:
- pip3 install google-api-python-client==1.7.8
# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- >-
ulimit -n 65536;
ray start
--head
--port=6379
--object-manager-port=8076
--autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- >-
ulimit -n 65536;
ray start
--address=$RAY_HEAD_IP:6379
--object-manager-port=8076
Hey @Yoav, the following changes should enable the Ray cluster to start workers. Let me know if you run into any issues!
@@ -46,7 +46,7 @@ worker_nodes:
autoDelete: true
type: PERSISTENT
initializeParams:
- diskSizeGb: 10
+ diskSizeGb: 50 # Requested disk size cannot be smaller than the image size (30 GB)"
# See https://cloud.google.com/compute/docs/images for more images
sourceImage: projects/deeplearning-platform-release/global/images/family/tf-1-13-cpu
# Run workers on preemtible instance by default.
@@ -92,6 +92,8 @@ setup_commands:
# Custom commands that will be run on the head node after common setup.
head_setup_commands:
+ - sudo chown ray ~/ray_bootstrap_key.pem
+ - sudo chown ray ~/ray_bootstrap_config.yaml
- pip3 install google-api-python-client==1.7.8
# Custom commands that will be run on worker nodes after common setup.
Oh. I am now getting this error again, and I do have the fix above in my .yaml file.
==> /tmp/ray/session_latest/logs/monitor.err <==
Error in sys.excepthook:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 852, in custom_excepthook
worker_id = global_worker.worker_id
AttributeError: 'Worker' object has no attribute 'worker_id'
Original exception was:
Traceback (most recent call last):
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/monitor.py", line 356, in <module>
redis_password=args.redis_password)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/monitor.py", line 112, in __init__
event_summarizer=self.event_summarizer)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 87, in __init__
self.reset(errors_fatal=True)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 518, in reset
raise e
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/autoscaler.py", line 448, in reset
with open(self.config_path) as f:
PermissionError: [Errno 13] Permission denied: '/home/ray/ray_bootstrap_config.yaml'
==> /tmp/ray/session_latest/logs/monitor.log <==
One change is that I moved to a mostly docker-based setup, but, it did work with the docker-based setup for a while before, before starting to fail again.
Not sure how to investigate this further.
What I did notice, though, is the following:
It happens in an inconsistent way: sometimes a cluster fails to create workers, and sometimes it doesn’t.
I think it might be correlated with the number of tasks, ie, when my first scheduling on the cluster is for 2000 actors it sometimes results in the above error. Asking for 200 actors worked OK several times, with no errors. And then scheduling 200, having some workers up, stopping the job, and then running again with 2000, works fine.
Hey @Yoav It appears that the issue is that in the default sourceImage (basically the VM’s HostOS) is running as Ubuntu, but with UID = 1001. Inside the container, the running user (ray) uses UID = 1000. I’ll be working on a solution ASAP!
great, thanks!
(actually, i was just using the image from the example file. when i tried to switch to a google container-optimized-os image instead, things broke. it wasn’t that important to me so i switched back to the example image, but this may explain that behavior as well…)
With which version of Ray were you able to make this work? I tried the same yaml config with latest and nightly and only the head node is being created… I’ve been unable to make it work on GCP with the one in master even with the chown fix provided above…
With the config above i was able to bring up the 5 workers (legacy) but all are still in pending if I check the monitor. SSH command was successful, pip installs as well and then nothing…
Hmm, @philippe-boyd-maxa are there any logs that indicate failure or progress? If you could post the output of /tmp/ray/session_latest/monitor* that would be helpful for me to debug.
There’s nothing unusual in the logs, it just keeps going with. monitor.out and monitor.err doesn’t show any errors.
======== Autoscaler status: 2021-04-22 14:59:57.449422 ========
Node status
---------------------------------------------------------------
Healthy:
1 ray_head_default
Pending:
10.30.30.23: ray_worker, setting-up
10.30.30.30: ray_worker, setting-up
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.00/2.246 GiB memory
0.0/4.0 CPU
0.00/0.732 GiB object_store_memory
Demands:
(no resource demands)
Problem occurs when using sourceImage: projects/cos-cloud/global/images/family/cos-stable
Weird as it sound, if I create a VM on the same network, same network tags and all and just run a redis image on it, I’ll be able to connect to it from the worker. But I cannot connect to the redis server running on the head. (Connection Timeout)
If i use another source image such as projects/deeplearning-platform-release/global/images/family/common-cpu it works.
I’m running a docker container so anything else than COS is too bloated and unnecessary.
Ok so I found the issue after going down the rabbit hole… When launching a Ray cluster, it uses docker to initialize the image on the nodes, even when you use a COS image. Underneath, Ray uses the same gcloud compute instances create for every node instead of calling gcloud compute instances create-with-container which is a special API for COS image. The benefit of using the latter and setting the image in the API call is that it will open all TCP, UDP, ICMP ports in iptables so you’re free to have any open ports from your docker image (which is super logical, the rest can be done through GCP’s firewall)
TLDR: when using source image COS (projects/cos-cloud/global/images/family/cos-stable) only port 22 will be open on your VM because Ray is using docker to pull the image but not making any changes to the iptables
philippe_boyd@ray-default-head-3650dc66 ~ $ sudo /sbin/iptables -L
Chain INPUT (policy DROP)
target prot opt source destination
ACCEPT all -- anywhere anywhere state RELATED,ESTABLISHED
ACCEPT all -- anywhere anywhere
ACCEPT icmp -- anywhere anywhere
ACCEPT tcp -- anywhere anywhere tcp dpt:ssh
ACCEPT tcp -- anywhere anywhere **this is what it's missing**
If we have the docker section in Ray cluster’s config, why would we want anything else than a COS image to handle it?
Workaround: the workaround for the time being is the following but this shouldn’t be a permanent solution:
initialization_commands:
- sudo /sbin/iptables -A INPUT -p tcp -j ACCEPT
Yeah that’s a good point; I think we just were never aware of this! This is really interesting and seems like a valuable thing for Ray to do. Would you be open to pushing a pull-request against the Ray repository?
Hi @philippe-boyd-maxa. Did you try to install gpu drivers on the workers with cos images? Been having a strange bug, initilization_commands works fine on the head node, but they fail on the workers.
When using deep learning os images, the workers just hanged on a simple apt install command, the same command works just fine on the head node. I wonder if the workers are launched in a different manner even if they have the same node_config as the head node?
Then I switched to a cos image. Now the problem is that the worker fails to download the driver from a public url (wget) maybe because of the host firewall is blocking outgoing connections. I tried adding a few rules but has been unsuccessful so far.