1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.
2. Environment:
- Ray version: 2.51.1
- Python version: 3.13.5
- OS: Linux Mint and OpenSUSE
- Cloud/Infrastructure: Head on server, two workers on desktops with GPUs
- Other libs/tools (if relevant): Tensorflow
3. What happened vs. what you expected:
- Expected: Able to run Tensorflow training with multiworkermirrored
- Actual: Always crashing with “shape=(64, 32, 32, 3), dtype=float32)>
}) with an unsupported type (<class ‘tensorflow.python.distribute.values.PerReplica’>) to a Tensor"
I have been trying to run distributed training with Ray Train and Tensorflow using MultiworkerMirroredStrategy to be able to efficiently distribute it to two workers.
I have used these as examples:
But neither one runs on my cluster. I am not able to get either one or any other example with this strategy to run successfully on 2 nodes. With a single node it does run.
Is this somehow deprecated or how come not even the examples work? I’m also trying to understand if I even need it but based on my testing one node is as fast or faster when not using this.
I am using a simple CIFAR10 with CNN for my ray train learning and like I mentioned can’t even get the examples to run without crashing.
Any help would be extremely appreciated.