I use the following yaml file to launch the cluster.
cluster_name: gp-amplab
max_workers: 2
provider:
type: aws
region: us-west-1
available_node_types:
cpu-node:
min_workers: 0
max_workers: 0
node_config:
InstanceType: m5.4xlarge
ImageId: ami-02d4cdd49d3036d46
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 1000
resources: {}
1-gpu-node:
min_workers: 2
max_workers: 2
node_config:
InstanceType: g4dn.xlarge
ImageId: ami-02d4cdd49d3036d46
BlockDeviceMappings:
- DeviceName: /dev/sda1
Ebs:
VolumeSize: 1000
# InstanceMarketOptions:
# MarketType: spot
resources: {}
head_node_type: cpu-node
setup_commands:
- pip install ray[default] torch torchvision
I am running the following scripts to run on the head node.
from collections import Counter
import socket
import time
import ray
from pprint import pprint
ray.init("auto")
pprint(ray.nodes())
print('''This cluster consists of
{} nodes in total
{} CPU resources in total
'''.format(len(ray.nodes()), ray.cluster_resources()['CPU']))
@ray.remote
def f():
time.sleep(0.001)
# Return IP address.
return socket.gethostbyname(socket.gethostname())
object_ids = [f.remote() for _ in range(10000)]
ip_addresses = ray.get(object_ids)
print('Tasks executed')
for ip_address, num_tasks in Counter(ip_addresses).items():
print(' {} tasks on {}'.format(num_tasks, ip_address))
The printing results are
[{'Alive': True,
'MetricsExportPort': 50332,
'NodeID': '491ed2f1b87da37b91de1c4e80c0e2e0d67f3a28eb379cb2b30717ab',
'NodeManagerAddress': '10.0.1.113',
'NodeManagerHostname': 'ip-10-0-1-113',
'NodeManagerPort': 36095,
'ObjectManagerPort': 8076,
'ObjectStoreSocketName': '/tmp/ray/session_2022-05-02_14-49-56_372625_26032/sockets/plasma_store',
'RayletSocketName': '/tmp/ray/session_2022-05-02_14-49-56_372625_26032/sockets/raylet',
'Resources': {'CPU': 4.0,
'GPU': 1.0,
'accelerator_type:T4': 1.0,
'memory': 12025908428.0,
'node:10.0.1.113': 1.0,
'object_store_memory': 4813522944.0},
'alive': True},
{'Alive': True,
'MetricsExportPort': 35012,
'NodeID': 'fab52918c61484c57cb8e68d1216bb9017a7fa3b0850a635e925be2a',
'NodeManagerAddress': '10.0.1.254',
'NodeManagerHostname': 'ip-10-0-1-254',
'NodeManagerPort': 45085,
'ObjectManagerPort': 8076,
'ObjectStoreSocketName': '/tmp/ray/session_2022-05-02_14-49-56_372625_26032/sockets/plasma_store',
'RayletSocketName': '/tmp/ray/session_2022-05-02_14-49-56_372625_26032/sockets/raylet',
'Resources': {'CPU': 4.0,
'GPU': 1.0,
'accelerator_type:T4': 1.0,
'memory': 12025908428.0,
'node:10.0.1.254': 1.0,
'object_store_memory': 4812824985.0},
'alive': True},
{'Alive': True,
'MetricsExportPort': 51466,
'NodeID': '67f6152ece32b941862676dfec101d016c8284aec5cae360a07774fa',
'NodeManagerAddress': '10.0.1.48',
'NodeManagerHostname': 'ip-10-0-1-48',
'NodeManagerPort': 36711,
'ObjectManagerPort': 43399,
'ObjectStoreSocketName': '/tmp/ray/session_2022-05-02_14-49-56_372625_26032/sockets/plasma_store.1',
'RayletSocketName': '/tmp/ray/session_2022-05-02_14-49-56_372625_26032/sockets/raylet.1',
'Resources': {'CPU': 16.0,
'memory': 45530697319.0,
'node:10.0.1.48': 1.0,
'object_store_memory': 19513155993.0},
'alive': True},
{'Alive': True,
'MetricsExportPort': 54065,
'NodeID': '8fa80d524ad48ece64906e651bf30f24b0ae46f76e8fbd130a991c83',
'NodeManagerAddress': '10.0.1.48',
'NodeManagerHostname': 'ip-10-0-1-48',
'NodeManagerPort': 40213,
'ObjectManagerPort': 8076,
'ObjectStoreSocketName': '/tmp/ray/session_2022-05-02_14-49-56_372625_26032/sockets/plasma_store',
'RayletSocketName': '/tmp/ray/session_2022-05-02_14-49-56_372625_26032/sockets/raylet',
'Resources': {'CPU': 16.0,
'memory': 39129027380.0,
'node:10.0.1.48': 1.0,
'object_store_memory': 19564513689.0},
'alive': True}]
This cluster consists of
4 nodes in total
40.0 CPU resources in total
Tasks executed
6581 tasks on 10.0.1.48
1633 tasks on 10.0.1.254
1786 tasks on 10.0.1.113
It seems that the cpu node is repeated for one more time. Ideally, I hope this to be 3 nodes in total
.