Ray cluster doesn't work, even connected well

Tae_Hyun_Jo · May 27, 2022, 3:44pm

I’m trying to cluster two local machines.

<Setup: Success>
head: 192.168.0.21
node: 192.168.0.22

On head machine (@Python)
ray.init(address=‘auto’)

On node machine(@ terminal)
ray start --address=‘192.168.0.21:6379’

Again, on head machine (@Python)

“ray.nodes()” prints

[
{‘NodeID’: ‘f22c89a7bfb82521a5cd58104da454063f9d731ec507f3af6f7410e4’, ‘Alive’: True, ‘NodeManagerAddress’: ‘192.168.0.22’, ‘NodeManagerHostname’: ‘cp22’, ‘NodeManagerPort’: 35225, ‘ObjectManagerPort’: 44411, ‘ObjectStoreSocketName’: ‘/tmp/ray/session_2022-05-28_00-29-38_080158_24534/sockets/plasma_store’, ‘RayletSocketName’: ‘/tmp/ray/session_2022-05-28_00-29-38_080158_24534/sockets/raylet’, ‘MetricsExportPort’: 61799, ‘alive’: True, ‘Resources’: {‘CPU’: 4.0, ‘object_store_memory’: 858294681.0, ‘node:192.168.0.22’: 1.0, ‘memory’: 2002687591.0}},
{‘NodeID’: ‘a3807ff7292d67422155743a6bc78046cc80b5fe0bc0999d370e7308’, ‘Alive’: True, ‘NodeManagerAddress’: ‘192.168.0.21’, ‘NodeManagerHostname’: ‘cp21’, ‘NodeManagerPort’: 43211, ‘ObjectManagerPort’: 39253, ‘ObjectStoreSocketName’: ‘/tmp/ray/session_2022-05-28_00-29-38_080158_24534/sockets/plasma_store’, ‘RayletSocketName’: ‘/tmp/ray/session_2022-05-28_00-29-38_080158_24534/sockets/raylet’, ‘MetricsExportPort’: 44215, ‘alive’: True, ‘Resources’: {‘object_store_memory’: 940081152.0, ‘node:192.168.0.21’: 1.0, ‘CPU’: 4.0, ‘memory’: 1880162304.0}}
]

I can check two nodes are available and also
“ray.cluster_resources()” prints

{‘object_store_memory’: 1798375833.0, ‘CPU’: 8.0, ‘memory’: 3882849895.0, ‘node:192.168.0.22’: 1.0, ‘node:192.168.0.21’: 1.0}

so It seems that setup is well done.

But, when I run the codes for checking cluster runs well which is on documentation of ray.

import time

@ray.remote
def f():
time.sleep(0.01)
return ray._private.services.get_node_ip_address()

set(ray.get([f.remote() for _ in range(1000)]))

prints “{‘192.168.0.21’}”

I got the results that only head(192.168.0.21) run, except node(192.168.0.22).
Is there something that I’m missing?

yic · May 31, 2022, 8:08pm

@Tae_Hyun_Jo Ray will reuse the worker, so for your case, all f.remote() is running in the same node.

The scheduling policy is not optimized for balancing the loads, but for performance.

I saw you have 4 cpus per node. So one think you can try is to:

start two actors each with 4 cpus so that it will be forced to be distributed to two nodes
call the remote function in actor to get the node ip.

Topic		Replies	Views
Ray cluster didn't use all the available CPU nodes Ray Clusters	1	582	February 16, 2024
Ray cluster number issue Ray Clusters	6	434	June 6, 2022
Ray cluster is not found at node Ray Clusters	0	174	January 11, 2024
Fully client mode access to a Ray cluster seems not to be possible Ray Clusters	1	511	July 18, 2022
Help with starting a local ray cluster? Ray Clusters	2	500	July 28, 2024

Ray cluster doesn't work, even connected well

Related topics