[ray][k8s] Large network transmit in ray head node

Environment:

  • ray 1.0.0、python 3.6.9
  • k8s cluster

In order to use as much resources as possible, I tried these combinations, and didn’t run any code, but the network transmit in ray head node is so large (sent).

(1) each worker pod has 4 cpus (1119 cpus in total)


head node: 15 CPUs, 16Gi Memory
worker node: 4 CPUs, 2Gi Memory, 276 replicas

(2) each worker pods has 6 cpus (1119 cpus in total)


head node: 15 CPUs, 16Gi Memory
worker node: 6 CPUs, 4Gi Memory, 184 replicas

(3) each worker pods has 8 cpus (959 cpus in total)


head node: 15 CPUs, 16Gi Memory
worker node: 8 CPUs, 6Gi Memory, 118 replicas

(4) each worker pods has 15 cpus (1200 cpus in total)


head node: 15 CPUs, 16Gi Memory
worker node: 15 CPUs, 8Gi Memory, 79 replicas

So, I want to know:

  1. why the head node sent so many packets per seconds? I didn’t run any code yet.
  2. if I use nfs to share codes, how can I do to read code from local instead of distribute by GCS?

Head node contains a process called gcs_server which basically has much interaction with other parts of the systems. One of examples is the heartbeat data. The head node receives heartbeats from all nodes and send the combined result to every other node, which can use O(N^2) size payload (receives heartbeat from N nodes and then send the combined N heartbeat to N nodes back).

The received heartbeats are used by each node to run ray’s decentralized scheduler.

Thanks to reply. If I want to utilize more physical resources. Should I create more ray clusters instead of one?

Hey @crystalww the size of heartbeats (and thus network/cpu usage on the head node) grows roughly quadratically w.r.t the number of nodes in the cluster . It looks like you’ve already discovered this, but the cluster will scale better if you use bigger nodes (people tend to use 32/64+ cpus/node on big clusters).

If you’re running big clusters it may also help to give your head node a few extra cpus (or just set num_cpus=0 on the head node). The head node has extra processes like GCS, so giving the head node some extra resources to focus on control plane operations tends to help too.

1 Like

I am considering scaling my cluster up to 1000 machines, each containing 36 cpus, but I’m afraid my head node’s bandwidth will be filled with the quadratically growing heartbeat packets, resulting in a congested network. Is there any good idea to deal with this?

Hey you may want to check out the deployment guide. Ray Deployment Guide — Ray v1.9.1

Well I’ve read that, and it seems that the most useful way to solve it is to enlarge my head node’s bandwidth. I’ll try something else anyway, thx.

Do you still have issues with recent ray versions? We’ve had some improvement in this area since 1.0 and plan to further improve it.