Cannot initialize worker nodes on aws cloud

Hello, I need your help.
I tried to launch the ray cluster on aws clouds. I used my administrator account, role, iam, and pem keys, and here is my ray-cluster.yaml

cluster_name: "my-cluster-name"
min_workers: 4
max_workers: 4
upscaling_speed: 1.0
idle_timeout_minutes: 5

docker:
    image: rayproject/ray-ml:f67ff3-py38-cu112
    container_name: "ray_container"
    pull_before_run: True
    run_options:
      - --ulimit nofile=65536:65536

provider:
    type: aws
    region: ap-northeast-2
    availability_zone: ap-northeast-2a, ap-northeast-2b
    cache_stopped_nodes: True
    security_group:
      GroupName: "my-sg-name"

auth:
    ssh_user: ubuntu
    ssh_private_key: <my-pem-key-name(could access all resources for aws)>

available_node_types:
  ray.head.default:
    resources: {"CPU": 8, "GPU": 1}
    node_config:
      InstanceType: g4dn.xlarge
      ImageId: ami-0047595ba1dead337  # official deep learning ami
      KeyName: "<my-pem-key-name>"
  ray.worker.default:
    resources: {"CPU": 4, "GPU": 1}
    min_workers: 2
    max_workers: 4
    node_config:
      InstanceType: g4dn.xlarge
      ImageId: ami-0047595ba1dead337
      InstanceMarketOptions:
        MarketType: spot
      KeyName: "<my-pem-key-name>"


head_node_type: ray.head.default

head_setup_commands:
    - pip install kmeanstf
    - pip install opencv-python==4.5.1.48
    - sudo apt-get install htop -y
    - sudo apt-get install vim -y
    - export CUDA_VISIBLE_DEVICES=0
    - sudo chown ray ~/ray_bootstrap_key.pem
    - sudo chown ray ~/ray_bootstrap_config.yaml

worker_setup_commands:
    - pip install kmeanstf
    - pip install opencv-python==4.5.1.48
    - sudo apt-get install htop -y
    - sudo apt-get install vim -y
    - export CUDA_VISIBLE_DEVICES=0

head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~./ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

#initialization_commands: []
#setup_commands: []
file_mounts: {
  "/home/ray/image_urls.txt": "/Users/username/workspace/ray/image_urls.txt",
  "/home/ray/color_cluster.py": "/Users/username/workspace/ray/color_cluster.py",
  "/home/ray/image_loader.py": "/Users/username/workspace/ray/image_loader.py",
}

And then, i tried these commands and get some statements.

ray up ray-cluster.yaml

>> AWS config
  IAM Profile: ray-autoscaler-v1 [default]
  EC2 Key pair (all available node types): <my-pem-key-name>
  VPC Subnets (all available node types): subnet-hash [default]
  EC2 Security groups (all available node types): sg-hash [default]
  EC2 AMI (all available node types): ami-0047595ba1dead337

No head node found. Launching a new cluster. Confirm [y/N]: y

Acquiring an up-to-date head node
  Reusing nodes i-09378351189659931. To disable reuse, set `cache_stopped_nodes: False` under `provider` in the cluster configuration.
  Stopping instances to reuse
  Launched a new head node
  Fetching the new head node

<1/1> Setting up head node
  Prepared bootstrap config
  New status: waiting-for-ssh
  [1/7] Waiting for SSH to become available
    Running `uptime` as a test.
    Waiting for IP
      Not yet available, retrying in 5 seconds
      Received: <ec2-ip>
ssh: connect to host  <ec2-ip> port 22: Operation timed out
    SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host  <ec2-ip> port 22: Connection refused
    SSH still not available (SSH command failed.), retrying in 5 seconds.
Warning: Permanently added ' <ec2-ip>' (ECDSA) to the list of known hosts.
 11:50:54 up 0 min,  1 user,  load average: 0.15, 0.03, 0.01
Shared connection to  <ec2-ip> closed.
    Success.
  Updating cluster configuration. [hash=99158fb606dc6f48ffa22a505f07b16055191a9d]

...

Obviously, i prepared num of minimal and maximum worker nodes with my ray-cluster.yaml. but i could get only head nodes.

Although i’m already tried almost of solutions and tips, my cluster contains head node only…
how can i fixed this issue?

thanks for your pleasure.

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity
  • Low: It annoys or frustrates me for a moment.
  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.
  • High: It blocks me to complete my task.

It could be either that there are no spot instances available or the worker setup commands are failing. Can you see if there are any errors related to starting workers in the autoscaler logs found in /tmp/ray/session_latest/logs/monitor.* on the head node?

thanks for your favor.

first of all, spot instance wasn’t the cause of this case. Such as your advice, i tried to access logs in /tmp/ray/session_latest/logs/monitor.* , I was able to confirm that it was because of aws role and policy.

These dicussions worked for me.

thank you.