What is the correct way of using get_dataset_shard?

mitul93 · September 11, 2025, 7:59am

1. Severity of the issue: (select one)
Low: Annoying but doesn’t hinder my work.

2. Environment:

Ray version: 2.49.0
Python version: 3.10
OS: Ubuntu 24.04
Cloud/Infrastructure: AWS
Other libs/tools (if relevant):

Where should I call get_dataset_shard in training loop worker function? - Inside epoch iterator or outside epoch iterator.

In following example, get_dataset_shard is called before iterating over epoch

def training_loop_per_worker()
    ...
    # === Get Data ===
    train_ds = get_dataset_shard("train")
    ...
    for epoch in range(config["train_epochs"]):
        ...
    ...

In another example, it is called inside epoch iterator.

def train_func(config):
    ...
    for epoch in range(config["epochs"]):
        ...
        train_dataset_shard = train.get_dataset_shard("train")

Topic		Replies	Views
How to get dataset shard size in each train worker Ray Train	0	23	September 11, 2025
Issue in Ray dataset sharding	12	1167	October 15, 2022
Custom data sharing in DataParallelTrainer	1	148	April 16, 2024
XGBoostTrainer access to indices of data in Ray Dataset Ray Train	0	92	April 12, 2024
How to Keep Tensor Shape w/Ray Datasets? Ray Data	2	480	June 16, 2022

What is the correct way of using get_dataset_shard?

Related topics