I’m trying to follow the manual ray cluster initiation instructions, but with no luck. I’d appreciate any help here! Due to the security regime I cannot use ray-up or the Kubernetes approach.
Each node is a separate AWS EC2 instance, where the security group has been configured to allow incoming traffic on port 6379 and all outgoing traffic.
Head Node:
(base) [root@1eb0c9fea2e0 beta]# ray start --head --port=6379
Local node IP: 172.17.0.2
2021-02-04 20:27:43,916 INFO services.py:1171 -- View the Ray dashboard at http://localhost:8265
2021-02-04 20:27:43,918 WARNING services.py:1632 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
--------------------
Ray runtime started.
--------------------
Next steps
To connect to this Ray runtime from another node, run
ray start --address='172.17.0.2:6379' --redis-password='5241590000000000'
Alternatively, use the following Python code:
import ray
ray.init(address='auto', _redis_password='5241590000000000')
If connection fails, check your firewall settings and network configuration.
To terminate the Ray runtime, run
ray stop
Worker node:
(base) [root@e6be84eb8fae /]# ray start --address="10.251.66.9:6379" --redis-password='5241590000000000'
Local node IP: 172.17.0.2
2021-02-04 20:34:39,647 WARNING services.py:1632 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
--------------------
Ray runtime started.
--------------------
To terminate the Ray runtime, run
ray stop
(base) [root@e6be84eb8fae /]# ray timeline
Traceback (most recent call last):
File "/opt/conda/bin/ray", line 8, in <module>
sys.exit(main())
File "/opt/conda/lib/python3.8/site-packages/ray/scripts/scripts.py", line 1504, in main
return cli()
File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/ray/scripts/scripts.py", line 1346, in timeline
address = services.get_ray_address_to_use_or_die()
File "/opt/conda/lib/python3.8/site-packages/ray/_private/services.py", line 221, in get_ray_address_to_use_or_die
return find_redis_address_or_die()
File "/opt/conda/lib/python3.8/site-packages/ray/_private/services.py", line 233, in find_redis_address_or_die
raise ConnectionError(
ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting `address`.
The worker node claims to have connected to the head, but in actuality it did not. Note that the IP address has been changed to match the actual private IP.
Any tips?
BR,
Ryan