I am using RaySGD TFTrainer
3 replica & num_cpus_per_worker=1 batch_size=128
3 replica & num_cpus_per_worker=3 batch_size=128
not seeing any significant improvements, both are taking same time , how we can utilise best way num_cpus_per_worker
another thing I observed ,
1 replica & num_cpus_per_worker=1 batch_size=128
1 replica & num_cpus_per_worker=3 batch_size=128 (taking more time as compare to earlier one, only using on single PID[is it right behaviour??])
Env:
Ray v1.0.0
python - 3.8
TF 2.4.1
4 node cluster 6 core each