Cannot initialize worker nodes on aws cloud

newman · March 10, 2022, 12:05pm

Hello, I need your help.
I tried to launch the ray cluster on aws clouds. I used my administrator account, role, iam, and pem keys, and here is my ray-cluster.yaml

cluster_name: "my-cluster-name"
min_workers: 4
max_workers: 4
upscaling_speed: 1.0
idle_timeout_minutes: 5

docker:
    image: rayproject/ray-ml:f67ff3-py38-cu112
    container_name: "ray_container"
    pull_before_run: True
    run_options:
      - --ulimit nofile=65536:65536

provider:
    type: aws
    region: ap-northeast-2
    availability_zone: ap-northeast-2a, ap-northeast-2b
    cache_stopped_nodes: True
    security_group:
      GroupName: "my-sg-name"

auth:
    ssh_user: ubuntu
    ssh_private_key: <my-pem-key-name(could access all resources for aws)>

available_node_types:
  ray.head.default:
    resources: {"CPU": 8, "GPU": 1}
    node_config:
      InstanceType: g4dn.xlarge
      ImageId: ami-0047595ba1dead337  # official deep learning ami
      KeyName: "<my-pem-key-name>"
  ray.worker.default:
    resources: {"CPU": 4, "GPU": 1}
    min_workers: 2
    max_workers: 4
    node_config:
      InstanceType: g4dn.xlarge
      ImageId: ami-0047595ba1dead337
      InstanceMarketOptions:
        MarketType: spot
      KeyName: "<my-pem-key-name>"


head_node_type: ray.head.default

head_setup_commands:
    - pip install kmeanstf
    - pip install opencv-python==4.5.1.48
    - sudo apt-get install htop -y
    - sudo apt-get install vim -y
    - export CUDA_VISIBLE_DEVICES=0
    - sudo chown ray ~/ray_bootstrap_key.pem
    - sudo chown ray ~/ray_bootstrap_config.yaml

worker_setup_commands:
    - pip install kmeanstf
    - pip install opencv-python==4.5.1.48
    - sudo apt-get install htop -y
    - sudo apt-get install vim -y
    - export CUDA_VISIBLE_DEVICES=0

head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~./ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

#initialization_commands: []
#setup_commands: []
file_mounts: {
  "/home/ray/image_urls.txt": "/Users/username/workspace/ray/image_urls.txt",
  "/home/ray/color_cluster.py": "/Users/username/workspace/ray/color_cluster.py",
  "/home/ray/image_loader.py": "/Users/username/workspace/ray/image_loader.py",
}

And then, i tried these commands and get some statements.

ray up ray-cluster.yaml

>> AWS config
  IAM Profile: ray-autoscaler-v1 [default]
  EC2 Key pair (all available node types): <my-pem-key-name>
  VPC Subnets (all available node types): subnet-hash [default]
  EC2 Security groups (all available node types): sg-hash [default]
  EC2 AMI (all available node types): ami-0047595ba1dead337

No head node found. Launching a new cluster. Confirm [y/N]: y

Acquiring an up-to-date head node
  Reusing nodes i-09378351189659931. To disable reuse, set `cache_stopped_nodes: False` under `provider` in the cluster configuration.
  Stopping instances to reuse
  Launched a new head node
  Fetching the new head node

<1/1> Setting up head node
  Prepared bootstrap config
  New status: waiting-for-ssh
  [1/7] Waiting for SSH to become available
    Running `uptime` as a test.
    Waiting for IP
      Not yet available, retrying in 5 seconds
      Received: <ec2-ip>
ssh: connect to host  <ec2-ip> port 22: Operation timed out
    SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host  <ec2-ip> port 22: Connection refused
    SSH still not available (SSH command failed.), retrying in 5 seconds.
Warning: Permanently added ' <ec2-ip>' (ECDSA) to the list of known hosts.
 11:50:54 up 0 min,  1 user,  load average: 0.15, 0.03, 0.01
Shared connection to  <ec2-ip> closed.
    Success.
  Updating cluster configuration. [hash=99158fb606dc6f48ffa22a505f07b16055191a9d]

...

Obviously, i prepared num of minimal and maximum worker nodes with my ray-cluster.yaml. but i could get only head nodes.

Although i’m already tried almost of solutions and tips, my cluster contains head node only…
how can i fixed this issue?

thanks for your pleasure.

How severe does this issue affect your experience of using Ray?

None: Just asking a question out of curiosity
Low: It annoys or frustrates me for a moment.
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
High: It blocks me to complete my task.

Stephanie_Wang · March 11, 2022, 4:41pm

It could be either that there are no spot instances available or the worker setup commands are failing. Can you see if there are any errors related to starting workers in the autoscaler logs found in /tmp/ray/session_latest/logs/monitor.* on the head node?

newman · March 12, 2022, 1:53am

thanks for your favor.

first of all, spot instance wasn’t the cause of this case. Such as your advice, i tried to access logs in /tmp/ray/session_latest/logs/monitor.* , I was able to confirm that it was because of aws role and policy.

These dicussions worked for me.

github.com/ray-project/ray

[autoscaler] Ray Cluster Launcher on AWS | Minimizing Permissions

opened 05:57AM - 07 Jul 20 UTC

VishDev12

fix-docs

This (non) issue takes a brief look at how we can minimize the permissions grant…ed to the Ray Cluster Launcher when using it with AWS. The cluster launcher works by launching a single head node and using that node to launch the cluster’s worker nodes. If you’re using the launcher with AWS for the first time, an Instance Profile is auto-created and a role with full EC2 and S3 permissions is attached to it; this role also has the `sts:AssumeRole` permission. This works seamlessly for basic use-cases, but if you need to grant AWS permissions to the worker nodes – to allow them to access S3, for example – you’re going to need to make a few changes. While we’re doing that, let’s also trim down the EC2 and S3 permissions granted to the head node. ### Example Use Case Let’s say we need a setup that has the following properties: * The Ray Cluster Launcher is allowed to launch instances only in the us-west-1 region. * The head and the worker nodes will have access to the `ray-data` S3 bucket. ### Breakdown * The console you’re using to launch the cluster (launchpad) needs permissions to launch instances in the us-west-1 region. It also needs to assign an IAM role to the head node. * The head node needs similar permissions since it has to launch worker nodes in the same region and pass an IAM role to each one. It will also need to access the `ray-data` S3 bucket * The worker nodes will only need permissions to access the `ray-data` bucket. ### Steps #### 1. Create an IAM role to assign to the head node Role name: `ray-head-v1` If you create this role for EC2 on the AWS console, an instance profile will be automatically created. If you create this role using the AWS CLI, then create an instance profile of the same name and assign the role to it as below. ```bash aws iam create-instance-profile --instance-profile-name ray-head-v1 aws iam add-role-to-instance-profile --instance-profile-name ray-head-v1 --role-name ray-head-v1 ``` The AWS console page for this role will also list the ARN for the instance profile. Or to access it with the CLI: ```bash aws iam list-instance-profiles | grep ray-head-v1 ``` #### 2. Create an IAM role to assign to the worker node Role name: `ray-worker-v1` Follow the same procedure as the previous step. #### 3. Create an IAM policy that will allow EC2 instance launches Policy name: `ray-ec2-launcher` ```json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "ec2:RunInstances", "Resource": "arn:aws:ec2:us-west-1::image/ami-*" }, { "Effect": "Allow", "Action": "ec2:RunInstances", "Resource": [ "arn:aws:ec2:us-west-1:<aws-account-number>:instance/*", "arn:aws:ec2:us-west-1:<aws-account-number>:network-interface/*", "arn:aws:ec2:us-west-1:<aws-account-number>:subnet/*", "arn:aws:ec2:us-west-1:<aws-account-number>:key-pair/*", "arn:aws:ec2:us-west-1:<aws-account-number>:volume/*", "arn:aws:ec2:us-west-1:<aws-account-number>:security-group/*" ] }, { "Effect": "Allow", "Action": [ "ec2:TerminateInstances", "ec2:DeleteTags", "ec2:StartInstances", "ec2:CreateTags", "ec2:StopInstances" ], "Resource": "arn:aws:ec2:us-west-1:<aws-account-number>:instance/*" }, { "Effect": "Allow", "Action": [ "ec2:Describe*" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "iam:PassRole" ], "Resource": [ "arn:aws:iam::<aws-account-number>:instance-profile/ray-head-v1", "arn:aws:iam::<aws-account-number>:instance-profile/ray-worker-v1" ] } ] } ``` #### 4. Create a policy to access the S3 bucket Policy name: `ray-s3-access` ```json { "Version": "2012-10-17", "Statement": [ { "Action": [ "s3:*" ], "Effect": "Allow", "Resource": [ "arn:aws:s3:::ray-data/*", "arn:aws:s3:::ray-data", ] } ] } ``` #### 5. Assign both of the above policies to the `ray-head-v1` role You can do this either through the AWS console interactively or using the CLI with: ```bash aws iam attach-role-policy --policy-arn arn:aws:iam::<aws-account-number>:policy/ray-ec2-launcher --role-name ray-head-v1 aws iam attach-role-policy --policy-arn arn:aws:iam::<aws-account-number>:policy/ray-s3-access --role-name ray-head-v1 ``` #### 6. Assign the S3 access policy to the `ray-worker-v1` role #### 7. Assign the `ray-ec2-launcher policy` to a launchpad role/user This can optionally be done to limit the permissions assigned to the role/user that will be operating the Ray cluster launcher. For example, if you’re an AWS administrator and need to allow one of your users to (only) launch Ray clusters. #### 8. Edit your cluster config YAML file Under `head_node:`, add: ```yaml IamInstanceProfile: Arn: arn:aws:iam::<aws-account-number>:instance-profile/ray-head-v1 ``` Under `worker_nodes:`, add: ```yaml IamInstanceProfile: Arn: arn:aws:iam::<aws-account-number>:instance-profile/ray-worker-v1 ``` ### Summary While the `ray-ec2-launcher` policy has reduced permissions compared to the original, it’s still possible to whittle this down further by specifying the AMIs, subnets, key-pairs, etc that the cluster launcher is allowed to access, as opposed to using a wildcard.

thank you.

Jaehoon_Yang · April 8, 2024, 2:32am

Could you launch Ray Cluster in the zone of ap-northeast-2?

Topic		Replies	Views
AWS Permission issue while creating a cluster Ray Clusters	3	1920	April 14, 2022
Problem creating a cluster on AWS EC2, with a custom IAM role on the workers ta giving me an error autoscaling failed to start nodes of type ray.worker.default. (Unauthorized operation) Ray Clusters	0	534	April 18, 2023
Only head node started, not worker nodes Ray Clusters	1	1509	January 19, 2022
Starting up ray cluster on AWS EC2 instance Ray Clusters	4	1214	April 2, 2024
Ray cluster's worker node is pending Ray Clusters	2	1236	February 8, 2022

Cannot initialize worker nodes on aws cloud

Related topics