Train examples not running or showing NaN after setting placement_strategy "STRICT_SPREAD"

I have a local microk8s cluster running KubeRay with 1 head node + 3 worker nodes across 3 physical machines, and running the examples (on cpu, no gpu involved):
https://docs.ray.io/en/latest/train/train.html
https://docs.ray.io/en/latest/train/examples/tf/tensorflow_mnist_example.html

They both appeared to run fine, and had reasonable outputs and results. Based on the response to my previous question, I have added placement_strategy="STRICT_SPREAD" to the ScalingConfig and now both fail in different ways.

  1. The quickstart (from train.html page) shows activity on all the nodes now, but all the scoring is nan:
...
350/Unknown - 6s 11ms/step - loss: nan - mean_squared_error: nan
356/Unknown - 6s 11ms/step - loss: nan - mean_squared_error: nan
360/Unknown - 6s 11ms/step - loss: nan - mean_squared_error: nan
...
  1. The mnist example doesn’t train. It just keeps showing status every couple seconds, though it does show RUNNING:
| TensorflowTrainer_9cde3_00000 | RUNNING  | 10.1.215.18:1266 |

The cluster dashboard page does indicate a cmd line for every node (eg. ray::TunerInternal.fit, ray::_RayTrainWorker__execute.get_next,ray::AutoscalingRequester, ray::TrainTrainable.train). I looked at the logs for them but didn’t notice anything obvious indicating why it’s doing nothing.

This behavior indicates that communication between the different workers is faulty or blocked. Usually this means that your kubernetes nodes can’t communicate with each other on all ports.

The way Ray Train’s TensorflowTrainer (and ultimately all distributed tensorflow training) works is that every worker opens a port on their respective node. This port (+ the node IP) is then used in distributed training for communication of gradients and model updates.

Can you relax your kubernetes configuration to allow full communication between the pods?

Thank you. I found this: “By default, if no policies exist in a namespace, then all ingress and egress traffic is allowed to and from pods in that namespace”. I have not created any network policies. I see 2 hdfs related network policies, but that’s it. So, wouldn’t that mean full communication between the pods is already allowed? Just to make sure, I created a network policy to allow all ingress/egress on all pods and tried again, but the behavior for both examples is the same.

Can you try if network connection s work manually?

E.g. like this

# On pod 1
nc -k -l 1234
# On pod 2
echo hello | nc pod1 1234
# This should output "hello" in the pod1 process

And please try this both with the k8s DNS name and pod IP.

Ray Train sets up the DDP communication via IP addresses, I’m wondering if DNS is required instead.

Thank you for providing such detailed instructions. Using ip address, I confirmed that’s working between worker pods and worker to head pod.

Attempting to use pod names doesn’t work:

$ echo hello | nc raycluster-autoscaler-head-gdsnd 1234
nc: getaddrinfo for host "raycluster-autoscaler-head-gdsnd" port 1234: Name or service not known

However, that appears to be by design. pod names aren’t handled directly in dns. This k8s doc says that this should resolve: 10-1-215-52.default.pod.cluster.local which that does work:

$ echo hello | nc 10-1-215-52.default.pod.cluster.local 1234
[hello displayed in listening terminal]
echo hello | nc 10-1-215-52.default.pod 1234
[hello displayed in listening terminal]

I also tested dns resolution for services from the pod and that worked.