Only head node started, not worker nodes

hokyjack · December 28, 2021, 9:48pm

Hello, I have a problem with starting an autoscaling cluster on AWS.
I start the cluster with my YAML file, but then, no worker nodes are started. What I am doing wrong please?

Ray version 1.9.1 on WSL.
Python 3.7

# A unique identifier for the head node and workers of this cluster.
cluster_name: basic-ray3
# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers. min_workers defaults to 0.

upscaling_speed: 5.0

max_workers: 12

#idle_timeout_minutes: 5

available_node_types:
  ray.head.default:
    resources: {"CPU": 4}
    node_config:
      InstanceType: m5.xlarge
      KeyName: hoky-ray
      #ImageId: latest_dlami

  ray.worker.default:
    resources: {"CPU": 4}
    min_workers: 12
    max_workers: 12
    node_config:
      InstanceType: m5.xlarge
      KeyName: hoky-ray
      #ImageId: latest_dlami

# Cloud-provider specific configuration.
provider:
   type: aws
   region: us-west-2
   availability_zone: us-west-2a

# How Ray will authenticate with newly launched nodes.
auth:
   ssh_user: ubuntu
   ssh_private_key: ~/.ssh/hoky-ray.pem

setup_commands:
  - pip install ray[all] # We won’t use pytorch. 
# However, this and the following line demonstrate that you can specify arbitrary
# startup scripts on the cluster.
  - pip install empyrical pandas==1.2.3 tqdm


file_mounts: {
   "~": ".", # /mnt/c/Users/Hoky/ray
}

head_node_type: ray.head.default
worker_default_node_type: ray.worker.default

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

hoky@DESKTOP-P1L6T2J:/mnt/c/Users/Hoky/ray$ ray up cluster.yaml -y && ray submit cluster.yaml compute_remote.py
Cluster: basic-ray3

2021-12-28 22:28:40,196 INFO util.py:282 -- setting max workers for head node type to 0
Checking AWS environment settings
AWS config
  IAM Profile: ray-autoscaler-v1 [default]
  EC2 Key pair (all available node types): hoky-ray
  VPC Subnets (all available node types): subnet-071195b1ea481cc82 [default]
  EC2 Security groups (all available node types): sg-030fd1153af5ebf9d [default]
  EC2 AMI (all available node types): ami-0a2363a9cff180a64 [dlami]

No head node found. Launching a new cluster. Confirm [y/N]: y [automatic, due to --yes]

Acquiring an up-to-date head node
  Launched 1 nodes [subnet_id=subnet-071195b1ea481cc82]
    Launched instance i-00186169f828c3616 [state=pending, info=pending]
  Launched a new head node
  Fetching the new head node

<1/1> Setting up head node
  Prepared bootstrap config
  New status: waiting-for-ssh
  [1/7] Waiting for SSH to become available
    Running `uptime` as a test.
    Waiting for IP
      Not yet available, retrying in 5 seconds
      Received: 52.40.45.139

Thanks for the tips!

ckw017 · January 19, 2022, 1:04am

A different user had a similar issue here with the cluster feeling to create worker nodes, the solution was to update service account roles. Can you check if you’re seeing similar output from the scheduler

Topic		Replies	Views
Ray cluster is stuck in creating worker nodes Ray Clusters	0	406	August 27, 2021
Cannot initialize worker nodes on aws cloud Ray Clusters	3	1406	April 8, 2024
Ray cluster's worker node is pending Ray Clusters	2	1237	February 8, 2022
Starting up ray cluster on AWS EC2 instance Ray Clusters	4	1233	April 2, 2024
Workers never initialize Ray Core	7	64	June 5, 2025

Only head node started, not worker nodes

Related topics