TL;DR: I tried unsuccessfully to set up a minimal ray cluster on the Google Cloud Platform (GCP). I have set one head node and two additional nodes, but only the head node is recognized by the cluster.
First time posting here, i hope the formatting comes out fine.
I have been trying to set up a Ray cluster for hyper-parameter tuning on the GCP, but i am having trouble using more than one node.
I have been following the tutorial i found at here.
I am spawning 3 n1-standard-2
(1 head and 2 workers) machines with two cores each.
Right now when I execute the script proposed in the tutorial above and reported here:
from collections import Counter
import socket
import time
import ray
ray.init()
print('''This cluster consists of
{} nodes in total
{} CPU resources in total
'''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))
@ray.remote
def f():
time.sleep(0.001)
# Return IP address.
return socket.gethostbyname(socket.gethostname())
object_ids = [f.remote() for _ in range(10000)]
ip_addresses = ray.get(object_ids)
print('Tasks executed')
for ip_address, num_tasks in Counter(ip_addresses).items():
print(' {} tasks on {}'.format(num_tasks, ip_address))
the expected result, as stated in the tuorial, is:
This cluster consists of
3 nodes in total
6.0 CPU resources in total
Tasks executed
3561 tasks on XXX.XXX.X.XXX
2685 tasks on XXX.XXX.X.XXX
3754 tasks on XXX.XXX.X.XXX
but I get:
This cluster consists of
1 nodes in total
2.0 CPU resources in total
Tasks executed
10000 tasks on ip.address.head.node
suggesting that only one node (with two cores) is being used; presumably the head node.
In the Compute Engine on the GCP i can see that the correct number of VMs has been created:
When i
ray down -y example_full.yaml
i get a Requested 3 nodes to shut down.
which sounds correct.I am using the minimal example proposed in the tutorial with minor modifications. The full
.yaml
file is reported here:
cluster_name: minimal
provider:
type: gcp
region: europe-west1
availability_zone: europe-west1-b
project_id: humanitas-rad-ai-20-00
setup_commands:
- pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
head_setup_commands: []
worker_setup_commands: []
head_node:
machineType: n1-standard-2
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 50
# See https://cloud.google.com/compute/docs/images for more images
sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu
worker_nodes:
machineType: n1-standard-2
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 50
# See https://cloud.google.com/compute/docs/images for more images
sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu
scheduling:
- preemptible: true
min_workers: 2
max_workers: 2
Can anyone help me sort this out?
pierandrea