I am trying to create a ray aws cluster (with Hydra ray aws launcher http://hydra.cc/docs/plugins/ray_launcher). The head node is configured to be m4.large
and the worker nodes are configured to be p3.2xlarge
with market type spot
. The remote resources are specified to require 1 gpu.
However, the cluster hangs as it tries to create the worker nodes with the autoscaler.
Any suggestions?
Here is my config (I omitted from this print the KeyName
, SecurityGroupIds
, SubnetIds
, and IamInstanceProfile
. Both the head node and the worker nodes are set up with the same values from these keys).
# @package hydra.launcher
_target_: hydra_plugins.hydra_ray_launcher.ray_aws_launcher.RayAWSLauncher
env_setup:
pip_packages:
omegaconf: ${ray_pkg_version:omegaconf}
hydra_core: ${ray_pkg_version:hydra}
ray: ${ray_pkg_version:ray}
cloudpickle: ${ray_pkg_version:cloudpickle}
pickle5: 0.0.11
hydra_ray_launcher: 1.1.0
aioredis: 1.3.1
hydra-optuna-sweeper: 1.1.0
commands:
- echo 'conda activate proj' >> ~/.bashrc
- conda activate proj
- pip install --upgrade pip
ray:
init:
address: auto
remote:
num_gpus: 1
cluster:
cluster_name: default
min_workers: 2
max_workers: 3
initial_workers: 2
autoscaling_mode: default
target_utilization_fraction: 0.8
idle_timeout_minutes: 5
docker:
image: ''
container_name: ''
pull_before_run: true
run_options: []
provider:
type: aws
region: us-west-2
availability_zone: us-west-2a,us-west-2b
cache_stopped_nodes: false
key_pair:
key_name: hydra-${oc.env:USER,user}
auth:
ssh_user: ubuntu
head_node:
InstanceType: m4.large
ImageId: ami-032c7386322a72480
worker_nodes:
InstanceType: p3.2xlarge
ImageId: ami-032c7386322a72480
InstanceMarketOptions:
MarketType: spot
file_mounts: {}
initialization_commands: []
setup_commands: []
head_setup_commands: []
worker_setup_commands: []
head_start_ray_commands:
- ray stop
- ulimit -n 65536;ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
run_env: auto
stop_cluster: true
sync_up:
source_dir: .
target_dir: null
include:
- '*'
exclude:
- '*'
sync_down:
source_dir: null
target_dir: null
include: []
exclude:
- '*'
logging:
log_style: auto
color_mode: auto
verbosity: 0
create_update_cluster:
no_restart: false
restart_only: false
no_config_cache: false
teardown_cluster:
workers_only: false
keep_min_workers: false
ray status
returns:
======== Autoscaler status: 2021-08-26 21:33:00.494653 ========
Node status
---------------------------------------------------------------
Healthy:
1 ray-legacy-head-node-type
Pending:
ray-legacy-worker-node-type, 2 launching
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/2.0 CPU
0.00/4.498 GiB memory
0.00/2.249 GiB object_store_memory
Demands:
(no resource demands)
The autoscaler hangs:
(autoscaler +14m1s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +14m7s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +14m12s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +14m18s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +14m23s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +14m28s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +14m34s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +14m39s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +14m44s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +14m49s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +14m55s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +15m0s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +15m5s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +15m11s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +15m16s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +15m22s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +15m27s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +15m33s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +15m38s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +15m44s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +15m49s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +15m54s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +15m59s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +16m5s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +16m10s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +16m15s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +16m21s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +16m26s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +16m31s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +16m37s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +16m42s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +16m47s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +16m53s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +16m58s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +17m3s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +17m9s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +17m14s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +17m19s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +17m25s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +17m30s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +17m35s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +17m41s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +17m46s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +17m51s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +17m57s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +18m2s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +18m7s) Adding 2 nodes of type ray-legacy-worker-node-type.
(autoscaler +18m13s) Adding 2 nodes of type ray-legacy-worker-node-type.
...