Examples not running across multiple nodes in a cluster

I have a local microk8s cluster running KubeRay with 1 head node + 3 worker nodes across 3 physical machines, and running the examples (on cpu, no gpu involved):
https://docs.ray.io/en/latest/train/examples/tf/tensorflow_mnist_example.html
https://docs.ray.io/en/latest/train/train.html
I’ve modified them slightly to add ray.init(…) to point to my cluster.
They seem to run fine, except that as I watch the cluster page, it appears they’re running on only one node. Each separate time I run it, it might run on a different node, but a single run is not being distributed. Two separate runs will use two nodes, so it appears the cluster and nodes are healthy and working.

Is there some configuration or code I need to put in place to make it distribute the load across multiple nodes?

@mentics it depends a bit on the resources the workers needs and how many of them are available on your nodes.

For instance, if each of your nodes has 8 CPUs, and you only start 2 workers (like in the example), they can be scheduled on the same node, and ray usually prefers that to minimize communication overhead.

If you want to force them onto different nodes, you can specify placement_strategy="STRICT_SPREAD" in your ScalingConfig. This should distribute the workers.

Other ways to achieve the same goal is to allocate more CPUs to each worker (resources_per_worker in the ScalingConfig), so that only 1 worker fits onto one node. Yet another way is to increase the number of workers. Obviously, this all depends mostly on what you’re trying to achieve.

1 Like

Thank so much for the detailed response! placement_strategy="STRICT_SPREAD" is exactly what I was looking for. Right now, I’m just trying things out to make sure things are working on the cluster so when I do try my own code, bugs are more likely to be my code instead of cluster config.

I added that to both examples, but now they both fail in different ways. I started this new question to cover that.