Ray quick start demo not working on AWS

cwikibm · July 1, 2022, 7:36pm

I am trying to deploy ray demo on AWS from my MAC as described here: Ray Cluster Quick Start — Ray 1.13.0

I ran: ray up -y config.yaml to deploy ray cluster on AWS. I can ssh to the head node and see ray processes:

netstat -lntp

Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.53:53 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:44217 0.0.0.0:* LISTEN 3611/python
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN -
tcp6 0 0 :::22 :::* LISTEN -
tcp6 0 0 :::6379 :::* LISTEN 3594/gcs_server
tcp6 0 0 :::111 :::* LISTEN -
tcp6 0 0 :::10001 :::* LISTEN 3618/python

Per demo instructions I’ve changed script.py to use
import ray
ray.init(address=‘auto’)

However when I run: ray submit config.yaml script.py, I get an error

2022-07-01 13:44:29,385 INFO util.py:335 – setting max workers for head node type to 0
2022-07-01 13:44:29,385 INFO util.py:339 – setting max workers for ray.worker.default to 2
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: xx.xx.xx.xx
2022-07-01 13:44:31,274 INFO util.py:335 – setting max workers for head node type to 0
2022-07-01 13:44:31,274 INFO util.py:339 – setting max workers for ray.worker.default to 2
Fetched IP: xx.xx.xx.xx
Traceback (most recent call last):
File “/home/ubuntu/script.py”, line 7, in
ray.init(address=‘auto’)
File “/home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/ray/_private/client_mode_hook.py”, line 105, in wrapper
return func(*args, **kwargs)
File “/home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/ray/_private/worker.py”, line 1278, in init
bootstrap_address = services.canonicalize_bootstrap_address(address)
File “/home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/ray/_private/services.py”, line 459, in canonicalize_bootstrap_address
addr = get_ray_address_from_environment()
File “/home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/ray/_private/services.py”, line 358, in get_ray_address_from_environment
addr = _find_gcs_address_or_die()
File “/home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/ray/_private/services.py”, line 341, in _find_gcs_address_or_die
"Could not find any running Ray instance. "
ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting --address flag or RAY_ADDRESS environment variable.
Shared connection to xx.xx.xx.xx closed.

Trying to address the above error I’ve modified script.py to use
ray.init(address=‘ray://xx.xx.xx.xx:10001’), where xx.xx.xx.xx=head node ip

That change produces a different error:
File “/home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/ray/util/client/worker.py”, line 260, in _connect_channel
raise ConnectionError(“ray client connection timeout”)
ConnectionError: ray client connection timeout

The original source for script.py:

from collections import Counter
import socket
import time

import ray

ray.init(address=‘auto’)

print(‘’‘This cluster consists of
{} nodes in total
{} CPU resources in total
‘’’.format(len(ray.nodes()), ray.cluster_resources()[‘CPU’]))

@ray.remote
def f():
time.sleep(0.001)
# Return IP address.
return socket.gethostbyname(socket.gethostname())

object_ids = [f.remote() for _ in range(10000)]
ip_addresses = ray.get(object_ids)

print(‘Tasks executed’)
for ip_address, num_tasks in Counter(ip_addresses).items():
print(’ {} tasks on {}'.format(num_tasks, ip_address))

cluster config file:

An unique identifier for the head node and workers of this cluster.

cluster_name: minimal

Cloud-provider specific configuration.

provider:
type: aws
region: us-east-1

Any ideas how to get this demo running on AWS?

Thanks
Jerry

Dmitri · July 6, 2022, 5:04pm

We’ve noted the bug and will look into it early next week.

Dmitri · July 7, 2022, 9:02pm

Let’s continue the discussion on GitHub. Feel free to add additional details there.

github.com/ray-project/ray

[Autoscaler][Docs] AWS quickstart example fails to start

opened 09:01PM - 07 Jul 22 UTC

DmitriGekhtman

bug P0 infra

### What happened + What you expected to happen The AWS quickstart example appe…ars not to work. Assignees to fill in the details. See the discussion here: https://discuss.ray.io/t/ray-quick-start-demo-not-working-on-aws/6697 ### Versions / Dependencies Ray master? ### Reproduction script `ray up aws/example-minimal.yaml` ### Issue Severity High: It blocks me from completing my task.

Topic		Replies	Views
Ray Cluster tutorial on AWS cluster ran into IP issue unexpectedly Ray Clusters	0	480	May 22, 2022
Starting up ray cluster on AWS EC2 instance Ray Clusters	4	1322	April 2, 2024
Ray Up Not Starting Woker Ray Clusters	1	1393	May 12, 2022
Issues creating cluster on AWS Ray Core	4	472	January 18, 2022
Only head node started, not worker nodes Ray Clusters	1	1527	January 19, 2022

Ray quick start demo not working on AWS

An unique identifier for the head node and workers of this cluster.

Cloud-provider specific configuration.

Related topics