I am trying to deploy ray demo on AWS from my MAC as described here: Ray Cluster Quick Start — Ray 1.13.0
I ran: ray up -y config.yaml to deploy ray cluster on AWS. I can ssh to the head node and see ray processes:
netstat -lntp
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.53:53 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:44217 0.0.0.0:* LISTEN 3611/python
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN -
tcp6 0 0 :::22 :::* LISTEN -
tcp6 0 0 :::6379 :::* LISTEN 3594/gcs_server
tcp6 0 0 :::111 :::* LISTEN -
tcp6 0 0 :::10001 :::* LISTEN 3618/python
Per demo instructions I’ve changed script.py to use
import ray
ray.init(address=‘auto’)
However when I run: ray submit config.yaml script.py, I get an error
2022-07-01 13:44:29,385 INFO util.py:335 – setting max workers for head node type to 0
2022-07-01 13:44:29,385 INFO util.py:339 – setting max workers for ray.worker.default to 2
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: xx.xx.xx.xx
2022-07-01 13:44:31,274 INFO util.py:335 – setting max workers for head node type to 0
2022-07-01 13:44:31,274 INFO util.py:339 – setting max workers for ray.worker.default to 2
Fetched IP: xx.xx.xx.xx
Traceback (most recent call last):
File “/home/ubuntu/script.py”, line 7, in
ray.init(address=‘auto’)
File “/home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/ray/_private/client_mode_hook.py”, line 105, in wrapper
return func(*args, **kwargs)
File “/home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/ray/_private/worker.py”, line 1278, in init
bootstrap_address = services.canonicalize_bootstrap_address(address)
File “/home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/ray/_private/services.py”, line 459, in canonicalize_bootstrap_address
addr = get_ray_address_from_environment()
File “/home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/ray/_private/services.py”, line 358, in get_ray_address_from_environment
addr = _find_gcs_address_or_die()
File “/home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/ray/_private/services.py”, line 341, in _find_gcs_address_or_die
"Could not find any running Ray instance. "
ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting --address
flag or RAY_ADDRESS
environment variable.
Shared connection to xx.xx.xx.xx closed.
Trying to address the above error I’ve modified script.py to use
ray.init(address=‘ray://xx.xx.xx.xx:10001’), where xx.xx.xx.xx=head node ip
That change produces a different error:
File “/home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/ray/util/client/worker.py”, line 260, in _connect_channel
raise ConnectionError(“ray client connection timeout”)
ConnectionError: ray client connection timeout
The original source for script.py:
from collections import Counter
import socket
import time
import ray
ray.init(address=‘auto’)
print(‘’‘This cluster consists of
{} nodes in total
{} CPU resources in total
‘’’.format(len(ray.nodes()), ray.cluster_resources()[‘CPU’]))
@ray.remote
def f():
time.sleep(0.001)
# Return IP address.
return socket.gethostbyname(socket.gethostname())
object_ids = [f.remote() for _ in range(10000)]
ip_addresses = ray.get(object_ids)
print(‘Tasks executed’)
for ip_address, num_tasks in Counter(ip_addresses).items():
print(’ {} tasks on {}'.format(num_tasks, ip_address))
cluster config file:
An unique identifier for the head node and workers of this cluster.
cluster_name: minimal
Cloud-provider specific configuration.
provider:
type: aws
region: us-east-1
Any ideas how to get this demo running on AWS?
Thanks
Jerry