Ray quick start demo not working on AWS

I am trying to deploy ray demo on AWS from my MAC as described here: Ray Cluster Quick Start — Ray 1.13.0

I ran: ray up -y config.yaml to deploy ray cluster on AWS. I can ssh to the head node and see ray processes:

netstat -lntp

Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.53:53 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:44217 0.0.0.0:* LISTEN 3611/python
tcp 0 0 0.0.0.0:111 0.0.0.0:* LISTEN -
tcp6 0 0 :::22 :::* LISTEN -
tcp6 0 0 :::6379 :::* LISTEN 3594/gcs_server
tcp6 0 0 :::111 :::* LISTEN -
tcp6 0 0 :::10001 :::* LISTEN 3618/python

Per demo instructions I’ve changed script.py to use
import ray
ray.init(address=‘auto’)

However when I run: ray submit config.yaml script.py, I get an error

2022-07-01 13:44:29,385 INFO util.py:335 – setting max workers for head node type to 0
2022-07-01 13:44:29,385 INFO util.py:339 – setting max workers for ray.worker.default to 2
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Fetched IP: xx.xx.xx.xx
2022-07-01 13:44:31,274 INFO util.py:335 – setting max workers for head node type to 0
2022-07-01 13:44:31,274 INFO util.py:339 – setting max workers for ray.worker.default to 2
Fetched IP: xx.xx.xx.xx
Traceback (most recent call last):
File “/home/ubuntu/script.py”, line 7, in
ray.init(address=‘auto’)
File “/home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/ray/_private/client_mode_hook.py”, line 105, in wrapper
return func(*args, **kwargs)
File “/home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/ray/_private/worker.py”, line 1278, in init
bootstrap_address = services.canonicalize_bootstrap_address(address)
File “/home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/ray/_private/services.py”, line 459, in canonicalize_bootstrap_address
addr = get_ray_address_from_environment()
File “/home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/ray/_private/services.py”, line 358, in get_ray_address_from_environment
addr = _find_gcs_address_or_die()
File “/home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/ray/_private/services.py”, line 341, in _find_gcs_address_or_die
"Could not find any running Ray instance. "
ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting --address flag or RAY_ADDRESS environment variable.
Shared connection to xx.xx.xx.xx closed.

Trying to address the above error I’ve modified script.py to use
ray.init(address=‘ray://xx.xx.xx.xx:10001’), where xx.xx.xx.xx=head node ip

That change produces a different error:
File “/home/ubuntu/anaconda3/envs/tensorflow2_latest_p37/lib/python3.7/site-packages/ray/util/client/worker.py”, line 260, in _connect_channel
raise ConnectionError(“ray client connection timeout”)
ConnectionError: ray client connection timeout

The original source for script.py:

from collections import Counter
import socket
import time

import ray

ray.init(address=‘auto’)

print(‘’‘This cluster consists of
{} nodes in total
{} CPU resources in total
‘’’.format(len(ray.nodes()), ray.cluster_resources()[‘CPU’]))

@ray.remote
def f():
time.sleep(0.001)
# Return IP address.
return socket.gethostbyname(socket.gethostname())

object_ids = [f.remote() for _ in range(10000)]
ip_addresses = ray.get(object_ids)

print(‘Tasks executed’)
for ip_address, num_tasks in Counter(ip_addresses).items():
print(’ {} tasks on {}'.format(num_tasks, ip_address))


cluster config file:

An unique identifier for the head node and workers of this cluster.

cluster_name: minimal

Cloud-provider specific configuration.

provider:
type: aws
region: us-east-1

Any ideas how to get this demo running on AWS?

Thanks
Jerry

We’ve noted the bug and will look into it early next week.

Let’s continue the discussion on GitHub. Feel free to add additional details there.