Hi,
I’m trying to spin a ray cluster on AWS’s EC2 using a YAML file. After using ray up config.yaml it successfully creates two EC2 instances but, when I try to submit a job or look at Ray status it only indicates one node and
Pending:
<ip>: ray.worker.default, uninitialized
with no failures. I’ve removed security groups and path files for config file below
# An unique identifier for the head node and workers of this cluster.
cluster_name: ray-test
# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 2
# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 2.0
# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
image: "rayproject/ray-ml:latest-gpu" # gpu You can change this to latest-cpu if you don't need GPU support and want a faster startup
# image: rayproject/ray-ml:latest-gpu # use this one if you don't need ML dependencies, it's faster to pull
container_name: "ray_container"
# If true, pulls latest version of image. Otherwise, `docker run` will only pull the image
# if no cached version is present.
pull_before_run: True
run_options: # Extra options to pass into "docker run"
- --ulimit nofile=65536:65536
# Example of running a GPU head with CPU workers
# head_image: "rayproject/ray-ml:latest-gpu"
# Allow Ray to automatically detect GPUs
# worker_image: "rayproject/ray-ml:latest-cpu"
# worker_run_options: []
# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 10
# Cloud-provider specific configuration.
provider:
type: aws
region: us-east-1
availability_zone: us-east-1a,us-east-1b
auth:
ssh_user: ubuntu
ssh_private_key: /path/to/key/.pem
available_node_types:
ray.head.default:
# resources: {"CPU": 1, "GPU": 1, "custom": 5}
#resources: { "CPU": 4, "GPU": 1}
node_config:
# IamInstanceProfile:
# Name: "ray-autoscaler-v1"
InstanceType: p3.2xlarge #g4dn.2xlarge p3.2xlarge
ImageId: ami-029510cec6d69f121 #ami-029510cec6d69f121 # Deep Learning AMI (Ubuntu) Version 30
KeyName: <key-name>
SecurityGroupIds: [sg1, sg2, sg3] #See above for group IDS
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 200 #100GB
ray.worker.default:
min_workers: 1
max_workers: 2
resources: {}
node_config:
# IamInstanceProfile:
# Name: "ray-autoscaler-v1"
InstanceType: p3.2xlarge #g4dn.2xlarge p3.2xlarge
ImageId: ami-029510cec6d69f121 #ami-029510cec6d69f121 # Deep Learning AMI (Ubuntu) Version 30
KeyName: <key-name>
#InstanceMarketOptions:
# MarketType: spot
SecurityGroupIds: [<sg1>]
head_node_type: ray.head.default
file_mounts: {
# "/path2/on/remote/machine": "/path2/on/local/machine", #/home/ray
}
cluster_synced_files: []
file_mounts_sync_continuously: False
# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude:
- "**/.git"
- "**/.git/**"
# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter:
- ".gitignore"
# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []
# List of shell commands to run to set up nodes.
setup_commands:
- pip install -U ninja
- pip install -U lpips
- pip install tblib
# Note: if you're developing Ray, you probably want to create a Docker image that
# has your Ray repo pre-cloned. Then, you can replace the pip installs
# below with a git checkout <your_sha> (and possibly a recompile).
# To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image
# that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line:
# - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"
# Custom commands that will be run on the head node after common setup.
head_setup_commands: []
# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
head_node: {}
worker_nodes: {}```