Right now I’m facing another issue: Only one of my three nodes is being setup to be used in the cluster. Can you give a pointer why?
I even started the cluster with ray up cluster.yml --min-nodes=2
# An unique identifier for the head node and workers of this cluster.
cluster_name: default
## NOTE: Typically for local clusters, min_workers == max_workers == len(worker_ips).
# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
# Typically, min_workers == max_workers == len(worker_ips).
min_workers: 2
# The maximum number of workers nodes to launch in addition to the head node.
# This takes precedence over min_workers.
# Typically, min_workers == max_workers == len(worker_ips).
max_workers: 2
# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0
idle_timeout_minutes: 5
# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled. Assumes Docker is installed.
docker:
image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup
# image: rayproject/ray:latest-gpu # use this one if you don't need ML dependencies, it's faster to pull
container_name: "ray_container"
# If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
# if no cached version is present.
pull_before_run: True
run_options:
- -e PYTHONPATH=/home/ray/reinforced-learning # Extra options to pass into "docker run"
# Local specific configuration.
provider:
type: local
head_ip: 192.168.195.3
worker_ips: [192.168.195.4, 192.168.195.5]
# Optional when running automatic cluster management on prem. If you use a coordinator server,
# then you can launch multiple autoscaling clusters on the same set of machines, and the coordinator
# will assign individual nodes to clusters as needed.
# coordinator_address: "<host>:<port>"
# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: root
# Optional if an ssh private key is necessary to ssh to the cluster.
# ssh_private_key: ~/.ssh/id_rsa
# Leave this empty.
head_node: {}
# Leave this empty.
worker_nodes: {}
# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
# "/path1/on/remote/machine": "/path1/on/local/machine",
# "/path2/on/remote/machine": "/path2/on/local/machine",
}
# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []
# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: True
# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
- "**/.git"
- "**/.git/**"
# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
- ".gitignore"
# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []
# List of shell commands to run to set up each nodes.
setup_commands:
- pip install neptune-client
# Note: if you're developing Ray, you probably want to create a Docker image that
# has your Ray repo pre-cloned. Then, you can replace the pip installs
# below with a git checkout <your_sha> (and possibly a recompile).
# To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
# that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
# - pip install -U "ray[full] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"
# Custom commands that will be run on the head node after common setup.
head_setup_commands: []
# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- ulimit -c unlimited ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml --node-ip-address=192.168.195.3 --dashboard-host=192.168.195.3
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- ray start --address=$RAY_HEAD_IP:6379
The nodes are supposed to be started are are hanging in state unitialized
2021-03-26 14:51:11,452 WARNING worker.py:1107 -- The actor or task with ID ffffffffffffffffaf0ba4bd9518a2c62fc2131202000000 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, but this node only has remaining {0.000000/12.000000 CPU, 37.597656 GiB/37.597656 GiB memory, 12.939453 GiB/12.939453 GiB object_store_memory, 1.000000/1.000000 node:192.168.195.3}
. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
2021-03-26 14:51:11,872 DEBUG (unknown file):0 -- gc.collect() freed 27 refs in 0.35091638000449166 seconds
(autoscaler +1m26s) Tip: use `ray status` to view detailed autoscaling status. To disable autoscaler event messages, you can set AUTOSCALER_EVENTS=0.
(autoscaler +1m26s) Removing 2 nodes of type ray-legacy-worker-node-type (launch failed).
(autoscaler +1m31s) Adding 2 nodes of type ray-legacy-worker-node-type.
======== Autoscaler status: 2021-03-26 23:02:26.988953 ========
Node status
---------------------------------------------------------------
Healthy:
Pending:
192.168.195.4: ray-legacy-worker-node-type, uninitialized
192.168.195.5: ray-legacy-worker-node-type, uninitialized
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.00/67.285 GiB memory
0.00/21.680 GiB object_store_memory
16.0/16.0 CPU
1.0/1.0 GPU
Demands:
{'CPU': 1.0}: 1+ pending tasks/actors