Information on steps_per_epoch in distributed tensorflow

SumanthDatta · April 5, 2021, 4:14pm

Hi ,

Iam getting the following error .

ValueError: When dataset is sharded across workers, please specify a reasonable steps_per_epoch such that all workers will train the same number of steps and each step can get data from dataset without EOF. This is required for allreduce to succeed.We will handle the last partial batch in the future.

My training dataset is 43,846 records. batch size is 128, steps_per_epoch is 342.

This is my model

    model = Sequential()
    model.add(Dense(10, input_shape=(10,), activation='relu'))
    model.add(Dense(8, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

If I set num_replicas greater than 1 inside TFTrainer i see this error. Any idea how to solve this problem. Thanks in advance.

yic · April 5, 2021, 8:39pm

Hi there,
I feel lost in this question. Could you provide more details, like how do you use ray here?

Thanks
Yi Cheng.

SumanthDatta · April 9, 2021, 4:25pm

@yic , Iam using Tftrainer in ray.

TFTrainer(
model_creator=self.createModel,
data_creator=self.fetchValuesFromDatabase,
num_replicas=2,
num_cpus_per_worker=1,
use_gpu=use_gpu,
verbose=True,
config={“batch_size”: batch_size,
“fit_config”: {
“steps_per_epoch”: 342}})

If the number of replicas is greater than 1. I get the error

ValueError: When dataset is sharded across workers, please specify a reasonable steps_per_epoch such that all workers will train the same number of steps and each step can get data from dataset without EOF. This is required for allreduce to succeed.We will handle the last partial batch in the future .

steps_per_epoch is correct , that is (total_no of records // batchsize)

What is it that I am doing wrong over here?

SumanthDatta · April 15, 2021, 4:25pm

@sangcho , can you please help me with this question. It’s a blocker for me. Any suggestion will be of great help to me.

sangcho · April 15, 2021, 5:39pm

cc @kai I believe it is the Ray SGD question? Can you take a look at it?

Topic		Replies	Views
Issue in Ray dataset sharding	12	1102	October 15, 2022
Running the ray training example got error Configure Algorithm, Training, Evaluation, Scaling	1	415	November 2, 2023
Ray SGD distributed tensorflow Ray Train	8	716	December 17, 2020
[SGD] [Tune] Issue with ray.util.sgd.data.Dataset API Ray Tune	6	490	April 23, 2021
Distributed training with different number of batches Ray Data	0	22	May 11, 2025

Information on steps_per_epoch in distributed tensorflow

Related topics