Troubles setting up a Ray Cluster on the Google Cloud Platform (GCP)

pierandrea-humanitas · March 2, 2021, 1:41pm

TL;DR: I tried unsuccessfully to set up a minimal ray cluster on the Google Cloud Platform (GCP). I have set one head node and two additional nodes, but only the head node is recognized by the cluster.

First time posting here, i hope the formatting comes out fine.

I have been trying to set up a Ray cluster for hyper-parameter tuning on the GCP, but i am having trouble using more than one node.
I have been following the tutorial i found at here.
I am spawning 3 n1-standard-2 (1 head and 2 workers) machines with two cores each.
Right now when I execute the script proposed in the tutorial above and reported here:

from collections import Counter
import socket
import time

import ray

ray.init()

print('''This cluster consists of
    {} nodes in total
    {} CPU resources in total
'''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))

@ray.remote
def f():
    time.sleep(0.001)
    # Return IP address.
    return socket.gethostbyname(socket.gethostname())

object_ids = [f.remote() for _ in range(10000)]
ip_addresses = ray.get(object_ids)

print('Tasks executed')
for ip_address, num_tasks in Counter(ip_addresses).items():
    print('    {} tasks on {}'.format(num_tasks, ip_address))

the expected result, as stated in the tuorial, is:

This cluster consists of
    3 nodes in total
    6.0 CPU resources in total
Tasks executed
    3561 tasks on XXX.XXX.X.XXX
    2685 tasks on XXX.XXX.X.XXX
    3754 tasks on XXX.XXX.X.XXX

but I get:

This cluster consists of
    1 nodes in total
    2.0 CPU resources in total

Tasks executed
    10000 tasks on ip.address.head.node

suggesting that only one node (with two cores) is being used; presumably the head node.
In the Compute Engine on the GCP i can see that the correct number of VMs has been created:

When i ray down -y example_full.yaml i get a Requested 3 nodes to shut down. which sounds correct.
I am using the minimal example proposed in the tutorial with minor modifications. The full .yaml file is reported here:

cluster_name: minimal
provider:
    type: gcp
    region: europe-west1
    availability_zone: europe-west1-b
    project_id: humanitas-rad-ai-20-00
setup_commands: 
  - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
head_setup_commands: []
worker_setup_commands: []
head_node:
    machineType: n1-standard-2
    disks:
      - boot: true
        autoDelete: true
        type: PERSISTENT
        initializeParams:
          diskSizeGb: 50
          # See https://cloud.google.com/compute/docs/images for more images
          sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu

worker_nodes:
    machineType: n1-standard-2
    disks:
      - boot: true
        autoDelete: true
        type: PERSISTENT
        initializeParams:
          diskSizeGb: 50
          # See https://cloud.google.com/compute/docs/images for more images
          sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu
    scheduling:
      - preemptible: true

min_workers: 2
max_workers: 2

Can anyone help me sort this out?

pierandrea

rliaw · March 2, 2021, 11:14pm

I think the main thing here is that ray.init() should be ray.init(address="auto") on your script that you are spawning on the cluster:

import ray

ray.init(address="auto")

print('''This cluster consists of
    {} nodes in total
    {} CPU resources in total
'''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))

Could you help me post a github issue on the Ray github (and tag richardliaw)? I’ll make sure this gets fixed.

Dmitri · March 3, 2021, 3:28am

Here’s a quick PR fixing the docs:

Topic		Replies	Views
GCP Cluster Worker Nodes fail to Initialize Ray Clusters	5	505	October 10, 2024
Multiple GPU head node on GCP Ray Clusters	3	576	April 25, 2022
Can Ray support more than 1000 nodes?	1	533	February 2, 2022
Cluster usage is not 100% rather 57% Ray Clusters	0	417	October 21, 2021
[GCP] Ray Cluster on GCP scales up very slowly Ray Clusters	1	629	December 14, 2021

Troubles setting up a Ray Cluster on the Google Cloud Platform (GCP)

Related topics