Distributed training with different number of batches

marya · May 11, 2025, 8:12am

Hi,
I have a dataset stored in parquet files and I’m facing some issues:

Using ray.data.read_parquet directly raises issue as it is trying to make tensors out of columns, and my columns can contain arrays of different shapes (e.g., 1-channel or 3-channel grayscale images). I don’t have the option to modify the data.

“ValueError: ArrowVariableShapedTensorArray only supports tensor elements that all have the same number of dimensions, but got tensor elements with dimensions: 1, 2”.

Using petastorm make_reader, getting a Dataloader for each worker with his own shard by passing shard_count and cur_shard. This raises an issue as it doesn’t assure equal number of batches between workers, meaning when one worker finishes before others the other will wait for sync with all workers and they keep hanging. Another issue is that the number of batches between workers is unknown, as row groups are assigned to workers, and their size may vary.

Is there any way to deal with different number of batches between workers? Or handle columns before creating tensors?

Topic		Replies	Views
How to Keep Tensor Shape w/Ray Datasets? Ray Data	2	469	June 16, 2022
[Dataset] Ray Dataset reading multiple parquet files with different columns crashes due to TProtocolException: Exceeded size limit Ray Data	14	1898	November 17, 2022
Ray worker dies when reading multiple parquet files Ray Data	3	777	November 17, 2022
Issue in Ray dataset sharding	12	1107	October 15, 2022
Slow Large-Scale Ingest w/Ray AIR (Ray Data + Ray Train)	20	1627	July 28, 2022

Distributed training with different number of batches

Related topics