In order to use as much resources as possible, I tried these combinations, and didn’t run any code, but the network transmit in ray head node is so large (sent).
(1) each worker pod has 4 cpus (1119 cpus in total)
Head node contains a process called gcs_server which basically has much interaction with other parts of the systems. One of examples is the heartbeat data. The head node receives heartbeats from all nodes and send the combined result to every other node, which can use O(N^2) size payload (receives heartbeat from N nodes and then send the combined N heartbeat to N nodes back).
Hey @crystalww the size of heartbeats (and thus network/cpu usage on the head node) grows roughly quadratically w.r.t the number of nodes in the cluster . It looks like you’ve already discovered this, but the cluster will scale better if you use bigger nodes (people tend to use 32/64+ cpus/node on big clusters).
If you’re running big clusters it may also help to give your head node a few extra cpus (or just set num_cpus=0 on the head node). The head node has extra processes like GCS, so giving the head node some extra resources to focus on control plane operations tends to help too.
I am considering scaling my cluster up to 1000 machines, each containing 36 cpus, but I’m afraid my head node’s bandwidth will be filled with the quadratically growing heartbeat packets, resulting in a congested network. Is there any good idea to deal with this?