Training issues with MultiworkerMirroredStrategy

Ray Train’s distributed TensorFlow integration with MultiWorkerMirroredStrategy is not deprecated, but users have reported issues running official examples across multiple nodes, especially with recent TensorFlow and Keras versions. The error you encountered—“unsupported type (<class ‘tensorflow.python.distribute.values.PerReplica’>) to a Tensor”—is a known TensorFlow issue when handling distributed datasets, often related to how data is sharded or how the dataset is constructed in a distributed context. This problem does not occur on a single node, which matches your experience. According to the Ray distributed TensorFlow guide, Ray should handle TF_CONFIG and worker setup, but compatibility issues with newer TensorFlow/Keras versions (e.g., Keras 3.x) can break these examples, as discussed in community threads.

Additionally, distributed training with Ray Train and TensorFlow is only beneficial for large datasets or when scaling across multiple GPUs/nodes outweighs the communication overhead. For small datasets or single-node setups, distributed training can be slower or unnecessary, as noted in user discussions. If you are using recent TensorFlow/Keras versions, you may need to set os.environ["TF_USE_LEGACY_KERAS"] = "1" as a workaround, but this is not ideal for all users. Would you like a step-by-step breakdown of how to debug or adapt your code for multi-node distributed training with Ray and TensorFlow?

Sources:

Hint: Mention @RunLLM in the post for followups.