They both appeared to run fine, and had reasonable outputs and results. Based on the response to my previous question, I have added placement_strategy="STRICT_SPREAD" to the ScalingConfig and now both fail in different ways.
The quickstart (from train.html page) shows activity on all the nodes now, but all the scoring is nan:
...
350/Unknown - 6s 11ms/step - loss: nan - mean_squared_error: nan
356/Unknown - 6s 11ms/step - loss: nan - mean_squared_error: nan
360/Unknown - 6s 11ms/step - loss: nan - mean_squared_error: nan
...
The mnist example doesn’t train. It just keeps showing status every couple seconds, though it does show RUNNING:
The cluster dashboard page does indicate a cmd line for every node (eg. ray::TunerInternal.fit, ray::_RayTrainWorker__execute.get_next,ray::AutoscalingRequester, ray::TrainTrainable.train). I looked at the logs for them but didn’t notice anything obvious indicating why it’s doing nothing.
This behavior indicates that communication between the different workers is faulty or blocked. Usually this means that your kubernetes nodes can’t communicate with each other on all ports.
The way Ray Train’s TensorflowTrainer (and ultimately all distributed tensorflow training) works is that every worker opens a port on their respective node. This port (+ the node IP) is then used in distributed training for communication of gradients and model updates.
Can you relax your kubernetes configuration to allow full communication between the pods?
Thank you. I found this: “By default, if no policies exist in a namespace, then all ingress and egress traffic is allowed to and from pods in that namespace”. I have not created any network policies. I see 2 hdfs related network policies, but that’s it. So, wouldn’t that mean full communication between the pods is already allowed? Just to make sure, I created a network policy to allow all ingress/egress on all pods and tried again, but the behavior for both examples is the same.
Thank you for providing such detailed instructions. Using ip address, I confirmed that’s working between worker pods and worker to head pod.
Attempting to use pod names doesn’t work:
$ echo hello | nc raycluster-autoscaler-head-gdsnd 1234
nc: getaddrinfo for host "raycluster-autoscaler-head-gdsnd" port 1234: Name or service not known
However, that appears to be by design. pod names aren’t handled directly in dns. This k8s doc says that this should resolve: 10-1-215-52.default.pod.cluster.local which that does work: