Ray cluster number issue

Jimmy · May 2, 2022, 6:06pm

I use the following yaml file to launch the cluster.

cluster_name: gp-amplab

max_workers: 2

provider:
    type: aws
    region: us-west-1

available_node_types:
    cpu-node:
        min_workers: 0
        max_workers: 0
        node_config:
            InstanceType: m5.4xlarge
            ImageId: ami-02d4cdd49d3036d46
            BlockDeviceMappings:
                - DeviceName: /dev/sda1
                  Ebs:
                      VolumeSize: 1000
        resources: {}
    1-gpu-node:
        min_workers: 2
        max_workers: 2
        node_config:
            InstanceType: g4dn.xlarge
            ImageId: ami-02d4cdd49d3036d46
            BlockDeviceMappings:
                - DeviceName: /dev/sda1
                  Ebs:
                      VolumeSize: 1000
            # InstanceMarketOptions:
            #     MarketType: spot
        resources: {}

head_node_type: cpu-node

setup_commands:
    - pip install ray[default] torch torchvision

I am running the following scripts to run on the head node.


from collections import Counter
import socket
import time

import ray
from pprint import pprint
ray.init("auto")

pprint(ray.nodes())
print('''This cluster consists of
    {} nodes in total
    {} CPU resources in total
'''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))

@ray.remote
def f():
    time.sleep(0.001)
    # Return IP address.
    return socket.gethostbyname(socket.gethostname())

object_ids = [f.remote() for _ in range(10000)]
ip_addresses = ray.get(object_ids)

print('Tasks executed')
for ip_address, num_tasks in Counter(ip_addresses).items():
    print('    {} tasks on {}'.format(num_tasks, ip_address))

The printing results are

[{'Alive': True,
  'MetricsExportPort': 50332,
  'NodeID': '491ed2f1b87da37b91de1c4e80c0e2e0d67f3a28eb379cb2b30717ab',
  'NodeManagerAddress': '10.0.1.113',
  'NodeManagerHostname': 'ip-10-0-1-113',
  'NodeManagerPort': 36095,
  'ObjectManagerPort': 8076,
  'ObjectStoreSocketName': '/tmp/ray/session_2022-05-02_14-49-56_372625_26032/sockets/plasma_store',
  'RayletSocketName': '/tmp/ray/session_2022-05-02_14-49-56_372625_26032/sockets/raylet',
  'Resources': {'CPU': 4.0,
                'GPU': 1.0,
                'accelerator_type:T4': 1.0,
                'memory': 12025908428.0,
                'node:10.0.1.113': 1.0,
                'object_store_memory': 4813522944.0},
  'alive': True},
 {'Alive': True,
  'MetricsExportPort': 35012,
  'NodeID': 'fab52918c61484c57cb8e68d1216bb9017a7fa3b0850a635e925be2a',
  'NodeManagerAddress': '10.0.1.254',
  'NodeManagerHostname': 'ip-10-0-1-254',
  'NodeManagerPort': 45085,
  'ObjectManagerPort': 8076,
  'ObjectStoreSocketName': '/tmp/ray/session_2022-05-02_14-49-56_372625_26032/sockets/plasma_store',
  'RayletSocketName': '/tmp/ray/session_2022-05-02_14-49-56_372625_26032/sockets/raylet',
  'Resources': {'CPU': 4.0,
                'GPU': 1.0,
                'accelerator_type:T4': 1.0,
                'memory': 12025908428.0,
                'node:10.0.1.254': 1.0,
                'object_store_memory': 4812824985.0},
  'alive': True},
 {'Alive': True,
  'MetricsExportPort': 51466,
  'NodeID': '67f6152ece32b941862676dfec101d016c8284aec5cae360a07774fa',
  'NodeManagerAddress': '10.0.1.48',
  'NodeManagerHostname': 'ip-10-0-1-48',
  'NodeManagerPort': 36711,
  'ObjectManagerPort': 43399,
  'ObjectStoreSocketName': '/tmp/ray/session_2022-05-02_14-49-56_372625_26032/sockets/plasma_store.1',
  'RayletSocketName': '/tmp/ray/session_2022-05-02_14-49-56_372625_26032/sockets/raylet.1',
  'Resources': {'CPU': 16.0,
                'memory': 45530697319.0,
                'node:10.0.1.48': 1.0,
                'object_store_memory': 19513155993.0},
  'alive': True},
 {'Alive': True,
  'MetricsExportPort': 54065,
  'NodeID': '8fa80d524ad48ece64906e651bf30f24b0ae46f76e8fbd130a991c83',
  'NodeManagerAddress': '10.0.1.48',
  'NodeManagerHostname': 'ip-10-0-1-48',
  'NodeManagerPort': 40213,
  'ObjectManagerPort': 8076,
  'ObjectStoreSocketName': '/tmp/ray/session_2022-05-02_14-49-56_372625_26032/sockets/plasma_store',
  'RayletSocketName': '/tmp/ray/session_2022-05-02_14-49-56_372625_26032/sockets/raylet',
  'Resources': {'CPU': 16.0,
                'memory': 39129027380.0,
                'node:10.0.1.48': 1.0,
                'object_store_memory': 19564513689.0},
  'alive': True}]
This cluster consists of
    4 nodes in total
    40.0 CPU resources in total

Tasks executed
    6581 tasks on 10.0.1.48
    1633 tasks on 10.0.1.254
    1786 tasks on 10.0.1.113

It seems that the cpu node is repeated for one more time. Ideally, I hope this to be 3 nodes in total.

Dmitri · May 12, 2022, 3:26am

Weird!
What do you get when you run ray status on the head node?

Dmitri · May 12, 2022, 3:27am

Would you mind actually posting the same info in a bug report on the Ray github and tagging me (@DmitriGekhtman)

Jimmy · June 6, 2022, 3:03am

Thanks @Dmitri , closed!

sohail_4233 · June 6, 2022, 8:28am

@Jimmy Hi ! I am having issues setting up cluster could you please pair computer with me so I can setup a local cluster. I am working on some clinical research and have a huge dataset could you please help me ?

Dmitri · June 6, 2022, 4:16pm

@sohail_4233
Feel free to post about issues you’re encountering on this thread.

sohail_4233 · June 6, 2022, 8:27pm

@Dmitri Sure thanks, will do that.

Topic		Replies	Views
Ray Cluster Resources Issue Ray Clusters	2	349	November 30, 2022
Troubles setting up a Ray Cluster on the Google Cloud Platform (GCP) Ray Core	2	557	March 3, 2021
Ray cluster doesn't work, even connected well Ray Core	1	391	May 31, 2022
Ray cluster didn't use all the available CPU nodes Ray Clusters	1	587	February 16, 2024
Ray cluster is not found at node Ray Clusters	0	174	January 11, 2024

Ray cluster number issue

Related topics