Ray cluster number issue

I use the following yaml file to launch the cluster.

cluster_name: gp-amplab

max_workers: 2

provider:
    type: aws
    region: us-west-1

available_node_types:
    cpu-node:
        min_workers: 0
        max_workers: 0
        node_config:
            InstanceType: m5.4xlarge
            ImageId: ami-02d4cdd49d3036d46
            BlockDeviceMappings:
                - DeviceName: /dev/sda1
                  Ebs:
                      VolumeSize: 1000
        resources: {}
    1-gpu-node:
        min_workers: 2
        max_workers: 2
        node_config:
            InstanceType: g4dn.xlarge
            ImageId: ami-02d4cdd49d3036d46
            BlockDeviceMappings:
                - DeviceName: /dev/sda1
                  Ebs:
                      VolumeSize: 1000
            # InstanceMarketOptions:
            #     MarketType: spot
        resources: {}

head_node_type: cpu-node

setup_commands:
    - pip install ray[default] torch torchvision

I am running the following scripts to run on the head node.


from collections import Counter
import socket
import time

import ray
from pprint import pprint
ray.init("auto")

pprint(ray.nodes())
print('''This cluster consists of
    {} nodes in total
    {} CPU resources in total
'''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))

@ray.remote
def f():
    time.sleep(0.001)
    # Return IP address.
    return socket.gethostbyname(socket.gethostname())

object_ids = [f.remote() for _ in range(10000)]
ip_addresses = ray.get(object_ids)

print('Tasks executed')
for ip_address, num_tasks in Counter(ip_addresses).items():
    print('    {} tasks on {}'.format(num_tasks, ip_address))

The printing results are

[{'Alive': True,
  'MetricsExportPort': 50332,
  'NodeID': '491ed2f1b87da37b91de1c4e80c0e2e0d67f3a28eb379cb2b30717ab',
  'NodeManagerAddress': '10.0.1.113',
  'NodeManagerHostname': 'ip-10-0-1-113',
  'NodeManagerPort': 36095,
  'ObjectManagerPort': 8076,
  'ObjectStoreSocketName': '/tmp/ray/session_2022-05-02_14-49-56_372625_26032/sockets/plasma_store',
  'RayletSocketName': '/tmp/ray/session_2022-05-02_14-49-56_372625_26032/sockets/raylet',
  'Resources': {'CPU': 4.0,
                'GPU': 1.0,
                'accelerator_type:T4': 1.0,
                'memory': 12025908428.0,
                'node:10.0.1.113': 1.0,
                'object_store_memory': 4813522944.0},
  'alive': True},
 {'Alive': True,
  'MetricsExportPort': 35012,
  'NodeID': 'fab52918c61484c57cb8e68d1216bb9017a7fa3b0850a635e925be2a',
  'NodeManagerAddress': '10.0.1.254',
  'NodeManagerHostname': 'ip-10-0-1-254',
  'NodeManagerPort': 45085,
  'ObjectManagerPort': 8076,
  'ObjectStoreSocketName': '/tmp/ray/session_2022-05-02_14-49-56_372625_26032/sockets/plasma_store',
  'RayletSocketName': '/tmp/ray/session_2022-05-02_14-49-56_372625_26032/sockets/raylet',
  'Resources': {'CPU': 4.0,
                'GPU': 1.0,
                'accelerator_type:T4': 1.0,
                'memory': 12025908428.0,
                'node:10.0.1.254': 1.0,
                'object_store_memory': 4812824985.0},
  'alive': True},
 {'Alive': True,
  'MetricsExportPort': 51466,
  'NodeID': '67f6152ece32b941862676dfec101d016c8284aec5cae360a07774fa',
  'NodeManagerAddress': '10.0.1.48',
  'NodeManagerHostname': 'ip-10-0-1-48',
  'NodeManagerPort': 36711,
  'ObjectManagerPort': 43399,
  'ObjectStoreSocketName': '/tmp/ray/session_2022-05-02_14-49-56_372625_26032/sockets/plasma_store.1',
  'RayletSocketName': '/tmp/ray/session_2022-05-02_14-49-56_372625_26032/sockets/raylet.1',
  'Resources': {'CPU': 16.0,
                'memory': 45530697319.0,
                'node:10.0.1.48': 1.0,
                'object_store_memory': 19513155993.0},
  'alive': True},
 {'Alive': True,
  'MetricsExportPort': 54065,
  'NodeID': '8fa80d524ad48ece64906e651bf30f24b0ae46f76e8fbd130a991c83',
  'NodeManagerAddress': '10.0.1.48',
  'NodeManagerHostname': 'ip-10-0-1-48',
  'NodeManagerPort': 40213,
  'ObjectManagerPort': 8076,
  'ObjectStoreSocketName': '/tmp/ray/session_2022-05-02_14-49-56_372625_26032/sockets/plasma_store',
  'RayletSocketName': '/tmp/ray/session_2022-05-02_14-49-56_372625_26032/sockets/raylet',
  'Resources': {'CPU': 16.0,
                'memory': 39129027380.0,
                'node:10.0.1.48': 1.0,
                'object_store_memory': 19564513689.0},
  'alive': True}]
This cluster consists of
    4 nodes in total
    40.0 CPU resources in total

Tasks executed
    6581 tasks on 10.0.1.48
    1633 tasks on 10.0.1.254
    1786 tasks on 10.0.1.113

It seems that the cpu node is repeated for one more time. Ideally, I hope this to be 3 nodes in total.

Weird!
What do you get when you run ray status on the head node?

Would you mind actually posting the same info in a bug report on the Ray github and tagging me (@DmitriGekhtman)

Thanks @Dmitri , closed!

@Jimmy Hi ! I am having issues setting up cluster could you please pair computer with me so I can setup a local cluster. I am working on some clinical research and have a huge dataset could you please help me ?

@sohail_4233
Feel free to post about issues you’re encountering on this thread.

@Dmitri Sure thanks, will do that.