I am trying to create a ray aws cluster (with Hydra ray aws launcher http://hydra.cc/docs/plugins/ray_launcher). The head node is configured to be m4.large and the worker nodes are configured to be p3.2xlarge with market type spot. The remote resources are specified to require 1 gpu.
However, the cluster hangs as it tries to create the worker nodes with the autoscaler.
Any suggestions?
Here is my config (I omitted from this print the KeyName, SecurityGroupIds, SubnetIds, and IamInstanceProfile. Both the head node and the worker nodes are set up with the same values from these keys).
# @package hydra.launcher
_target_: hydra_plugins.hydra_ray_launcher.ray_aws_launcher.RayAWSLauncher
env_setup:
pip_packages:
omegaconf: ${ray_pkg_version:omegaconf}
hydra_core: ${ray_pkg_version:hydra}
ray: ${ray_pkg_version:ray}
cloudpickle: ${ray_pkg_version:cloudpickle}
pickle5: 0.0.11
hydra_ray_launcher: 1.1.0
aioredis: 1.3.1
hydra-optuna-sweeper: 1.1.0
commands:
- echo 'conda activate proj' >> ~/.bashrc
- conda activate proj
- pip install --upgrade pip
ray:
init:
address: auto
remote:
num_gpus: 1
cluster:
cluster_name: default
min_workers: 2
max_workers: 3
initial_workers: 2
autoscaling_mode: default
target_utilization_fraction: 0.8
idle_timeout_minutes: 5
docker:
image: ''
container_name: ''
pull_before_run: true
run_options: []
provider:
type: aws
region: us-west-2
availability_zone: us-west-2a,us-west-2b
cache_stopped_nodes: false
key_pair:
key_name: hydra-${oc.env:USER,user}
auth:
ssh_user: ubuntu
head_node:
InstanceType: m4.large
ImageId: ami-032c7386322a72480
worker_nodes:
InstanceType: p3.2xlarge
ImageId: ami-032c7386322a72480
InstanceMarketOptions:
MarketType: spot
file_mounts: {}
initialization_commands: []
setup_commands: []
head_setup_commands: []
worker_setup_commands: []
head_start_ray_commands:
- ray stop
- ulimit -n 65536;ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
run_env: auto
stop_cluster: true
sync_up:
source_dir: .
target_dir: null
include:
- '*'
exclude:
- '*'
sync_down:
source_dir: null
target_dir: null
include: []
exclude:
- '*'
logging:
log_style: auto
color_mode: auto
verbosity: 0
create_update_cluster:
no_restart: false
restart_only: false
no_config_cache: false
teardown_cluster:
workers_only: false
keep_min_workers: false
ray status returns:
======== Autoscaler status: 2021-08-26 21:33:00.494653 ========
Node status
---------------------------------------------------------------
Healthy:
1 ray-legacy-head-node-type
Pending:
ray-legacy-worker-node-type, 2 launching
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/2.0 CPU
0.00/4.498 GiB memory
0.00/2.249 GiB object_store_memory
Demands:
(no resource demands)
The autoscaler hangs:
(autoscaler +14m1s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +14m7s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +14m12s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +14m18s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +14m23s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +14m28s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +14m34s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +14m39s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +14m44s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +14m49s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +14m55s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +15m0s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +15m5s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +15m11s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +15m16s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +15m22s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +15m27s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +15m33s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +15m38s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +15m44s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +15m49s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +15m54s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +15m59s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +16m5s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +16m10s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +16m15s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +16m21s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +16m26s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +16m31s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +16m37s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +16m42s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +16m47s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +16m53s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +16m58s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +17m3s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +17m9s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +17m14s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +17m19s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +17m25s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +17m30s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +17m35s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +17m41s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +17m46s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +17m51s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +17m57s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +18m2s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +18m7s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +18m13s) Adding 2 nodes of type ray-legacy-worker-node-type.
...