ValueError: When dataset is sharded across workers, please specify a reasonable steps_per_epoch such that all workers will train the same number of steps and each step can get data from dataset without EOF. This is required for allreduce to succeed.We will handle the last partial batch in the future.
My training dataset is 43,846 records. batch size is 128, steps_per_epoch is 342.
If the number of replicas is greater than 1. I get the error
ValueError: When dataset is sharded across workers, please specify a reasonable steps_per_epoch such that all workers will train the same number of steps and each step can get data from dataset without EOF. This is required for allreduce to succeed.We will handle the last partial batch in the future .
steps_per_epoch is correct , that is (total_no of records // batchsize)