Examples not running across multiple nodes in a cluster

mentics · June 22, 2023, 7:43pm

I have a local microk8s cluster running KubeRay with 1 head node + 3 worker nodes across 3 physical machines, and running the examples (on cpu, no gpu involved):

I’ve modified them slightly to add ray.init(…) to point to my cluster.
They seem to run fine, except that as I watch the cluster page, it appears they’re running on only one node. Each separate time I run it, it might run on a different node, but a single run is not being distributed. Two separate runs will use two nodes, so it appears the cluster and nodes are healthy and working.

Is there some configuration or code I need to put in place to make it distribute the load across multiple nodes?

kai · June 23, 2023, 2:20pm

@mentics it depends a bit on the resources the workers needs and how many of them are available on your nodes.

For instance, if each of your nodes has 8 CPUs, and you only start 2 workers (like in the example), they can be scheduled on the same node, and ray usually prefers that to minimize communication overhead.

If you want to force them onto different nodes, you can specify placement_strategy="STRICT_SPREAD" in your ScalingConfig. This should distribute the workers.

Other ways to achieve the same goal is to allocate more CPUs to each worker (resources_per_worker in the ScalingConfig), so that only 1 worker fits onto one node. Yet another way is to increase the number of workers. Obviously, this all depends mostly on what you’re trying to achieve.

mentics · June 23, 2023, 4:10pm

Thank so much for the detailed response! placement_strategy="STRICT_SPREAD" is exactly what I was looking for. Right now, I’m just trying things out to make sure things are working on the cluster so when I do try my own code, bugs are more likely to be my code instead of cluster config.

I added that to both examples, but now they both fail in different ways. I started this new question to cover that.

Topic		Replies	Views
Train examples not running or showing NaN after setting placement_strategy "STRICT_SPREAD"	4	336	June 27, 2023
Tensorflow and Pytorch cannot distributed training Ray Data	6	187	February 28, 2024
About CPU Usage in multi nodes Ray Core	2	340	February 14, 2023
Placement group with iterator to spread function to all CPU's in the cluster Ray Core	6	386	June 8, 2022
Some questions about Ray on Kubernetes Ray Clusters	3	771	December 3, 2021

Examples not running across multiple nodes in a cluster

Related topics