Hi,
I have a dataset stored in parquet files and I’m facing some issues:
- Using ray.data.read_parquet directly raises issue as it is trying to make tensors out of columns, and my columns can contain arrays of different shapes (e.g., 1-channel or 3-channel grayscale images). I don’t have the option to modify the data.
“ValueError: ArrowVariableShapedTensorArray only supports tensor elements that all have the same number of dimensions, but got tensor elements with dimensions: 1, 2”.
- Using petastorm make_reader, getting a Dataloader for each worker with his own shard by passing shard_count and cur_shard. This raises an issue as it doesn’t assure equal number of batches between workers, meaning when one worker finishes before others the other will wait for sync with all workers and they keep hanging. Another issue is that the number of batches between workers is unknown, as row groups are assigned to workers, and their size may vary.
Is there any way to deal with different number of batches between workers? Or handle columns before creating tensors?