[RaySGD] how to utilise num_cpus_per_worker best way?

I am using RaySGD TFTrainer

3 replica & num_cpus_per_worker=1 batch_size=128
3 replica & num_cpus_per_worker=3 batch_size=128

not seeing any significant improvements, both are taking same time , how we can utilise best way num_cpus_per_worker

another thing I observed ,
1 replica & num_cpus_per_worker=1 batch_size=128
1 replica & num_cpus_per_worker=3 batch_size=128 (taking more time as compare to earlier one, only using on single PID[is it right behaviour??])

Env:
Ray v1.0.0
python - 3.8
TF 2.4.1
4 node cluster 6 core each

num_cpus_per_worker is just a resource specification – it actually won’t change anything.

You should make sure num_cpus_per_worker == DataLoader(num_workers).