Ray cluster doesn't work, even connected well

I’m trying to cluster two local machines.

<Setup: Success>
head: 192.168.0.21
node: 192.168.0.22

On head machine (@Python)
ray.init(address=‘auto’)

On node machine(@ terminal)
ray start --address=‘192.168.0.21:6379’

Again, on head machine (@Python)

“ray.nodes()” prints

[
{‘NodeID’: ‘f22c89a7bfb82521a5cd58104da454063f9d731ec507f3af6f7410e4’, ‘Alive’: True, ‘NodeManagerAddress’: ‘192.168.0.22’, ‘NodeManagerHostname’: ‘cp22’, ‘NodeManagerPort’: 35225, ‘ObjectManagerPort’: 44411, ‘ObjectStoreSocketName’: ‘/tmp/ray/session_2022-05-28_00-29-38_080158_24534/sockets/plasma_store’, ‘RayletSocketName’: ‘/tmp/ray/session_2022-05-28_00-29-38_080158_24534/sockets/raylet’, ‘MetricsExportPort’: 61799, ‘alive’: True, ‘Resources’: {‘CPU’: 4.0, ‘object_store_memory’: 858294681.0, ‘node:192.168.0.22’: 1.0, ‘memory’: 2002687591.0}},
{‘NodeID’: ‘a3807ff7292d67422155743a6bc78046cc80b5fe0bc0999d370e7308’, ‘Alive’: True, ‘NodeManagerAddress’: ‘192.168.0.21’, ‘NodeManagerHostname’: ‘cp21’, ‘NodeManagerPort’: 43211, ‘ObjectManagerPort’: 39253, ‘ObjectStoreSocketName’: ‘/tmp/ray/session_2022-05-28_00-29-38_080158_24534/sockets/plasma_store’, ‘RayletSocketName’: ‘/tmp/ray/session_2022-05-28_00-29-38_080158_24534/sockets/raylet’, ‘MetricsExportPort’: 44215, ‘alive’: True, ‘Resources’: {‘object_store_memory’: 940081152.0, ‘node:192.168.0.21’: 1.0, ‘CPU’: 4.0, ‘memory’: 1880162304.0}}
]

I can check two nodes are available and also
“ray.cluster_resources()” prints

{‘object_store_memory’: 1798375833.0, ‘CPU’: 8.0, ‘memory’: 3882849895.0, ‘node:192.168.0.22’: 1.0, ‘node:192.168.0.21’: 1.0}

so It seems that setup is well done.

But, when I run the codes for checking cluster runs well which is on documentation of ray.

import time

@ray.remote
def f():
time.sleep(0.01)
return ray._private.services.get_node_ip_address()

set(ray.get([f.remote() for _ in range(1000)]))

prints “{‘192.168.0.21’}”

I got the results that only head(192.168.0.21) run, except node(192.168.0.22).
Is there something that I’m missing?

@Tae_Hyun_Jo Ray will reuse the worker, so for your case, all f.remote() is running in the same node.

The scheduling policy is not optimized for balancing the loads, but for performance.

I saw you have 4 cpus per node. So one think you can try is to:

  1. start two actors each with 4 cpus so that it will be forced to be distributed to two nodes
  2. call the remote function in actor to get the node ip.