Information on steps_per_epoch in distributed tensorflow

Hi ,

Iam getting the following error .

ValueError: When dataset is sharded across workers, please specify a reasonable steps_per_epoch such that all workers will train the same number of steps and each step can get data from dataset without EOF. This is required for allreduce to succeed.We will handle the last partial batch in the future.

My training dataset is 43,846 records. batch size is 128, steps_per_epoch is 342.

This is my model

    model = Sequential()
    model.add(Dense(10, input_shape=(10,), activation='relu'))
    model.add(Dense(8, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

If I set num_replicas greater than 1 inside TFTrainer i see this error. Any idea how to solve this problem. Thanks in advance.

Hi there,
I feel lost in this question. Could you provide more details, like how do you use ray here?

Thanks
Yi Cheng.

@yic , Iam using Tftrainer in ray.

TFTrainer(
model_creator=self.createModel,
data_creator=self.fetchValuesFromDatabase,
num_replicas=2,
num_cpus_per_worker=1,
use_gpu=use_gpu,
verbose=True,
config={“batch_size”: batch_size,
“fit_config”: {
“steps_per_epoch”: 342}})

If the number of replicas is greater than 1. I get the error

ValueError: When dataset is sharded across workers, please specify a reasonable steps_per_epoch such that all workers will train the same number of steps and each step can get data from dataset without EOF. This is required for allreduce to succeed.We will handle the last partial batch in the future .

steps_per_epoch is correct , that is (total_no of records // batchsize)

What is it that I am doing wrong over here?

@sangcho , can you please help me with this question. It’s a blocker for me. Any suggestion will be of great help to me.

cc @kai I believe it is the Ray SGD question? Can you take a look at it?