Hello, I need your help.
I tried to launch the ray cluster on aws clouds. I used my administrator account, role, iam, and pem keys, and here is my ray-cluster.yaml
And then, i tried these commands and get some statements.
ray up ray-cluster.yaml
>> AWS config
IAM Profile: ray-autoscaler-v1 [default]
EC2 Key pair (all available node types): <my-pem-key-name>
VPC Subnets (all available node types): subnet-hash [default]
EC2 Security groups (all available node types): sg-hash [default]
EC2 AMI (all available node types): ami-0047595ba1dead337
No head node found. Launching a new cluster. Confirm [y/N]: y
Acquiring an up-to-date head node
Reusing nodes i-09378351189659931. To disable reuse, set `cache_stopped_nodes: False` under `provider` in the cluster configuration.
Stopping instances to reuse
Launched a new head node
Fetching the new head node
<1/1> Setting up head node
Prepared bootstrap config
New status: waiting-for-ssh
[1/7] Waiting for SSH to become available
Running `uptime` as a test.
Waiting for IP
Not yet available, retrying in 5 seconds
Received: <ec2-ip>
ssh: connect to host <ec2-ip> port 22: Operation timed out
SSH still not available (SSH command failed.), retrying in 5 seconds.
ssh: connect to host <ec2-ip> port 22: Connection refused
SSH still not available (SSH command failed.), retrying in 5 seconds.
Warning: Permanently added ' <ec2-ip>' (ECDSA) to the list of known hosts.
11:50:54 up 0 min, 1 user, load average: 0.15, 0.03, 0.01
Shared connection to <ec2-ip> closed.
Success.
Updating cluster configuration. [hash=99158fb606dc6f48ffa22a505f07b16055191a9d]
...
Obviously, i prepared num of minimal and maximum worker nodes with my ray-cluster.yaml. but i could get only head nodes.
Although i’m already tried almost of solutions and tips, my cluster contains head node only…
how can i fixed this issue?
thanks for your pleasure.
How severe does this issue affect your experience of using Ray?
None: Just asking a question out of curiosity
Low: It annoys or frustrates me for a moment.
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
It could be either that there are no spot instances available or the worker setup commands are failing. Can you see if there are any errors related to starting workers in the autoscaler logs found in /tmp/ray/session_latest/logs/monitor.* on the head node?
first of all, spot instance wasn’t the cause of this case. Such as your advice, i tried to access logs in /tmp/ray/session_latest/logs/monitor.* , I was able to confirm that it was because of aws role and policy.