It is not shocking that at certain point, there is not enough work to split (and there is a fixed cost of managing multiple workers). Thus, it is not because you multiply the number of CPUs by 13 that your running time will divide by 13. However, I’m quite curious to understand about how parallelization works in a multi-CPU environment.
The settings @Abderraim is citing looks like
tune.run(
config={
#...
# Parallel Training CPU
'num_cpus_for_driver': 13,
'tf_session_args': {
'intra_op_parallelism_threads': 0,
'inter_op_parallelism_threads': 0,
'device_count': {
'CPU': 13,
}
},
'local_tf_session_args': {
'intra_op_parallelism_threads': 0,
'inter_op_parallelism_threads': 0,
},
#...
}
Thus, as we specify the number of cpus available in the tensorflow session, I understand that the parallelization happens at tensorflow level. Can anyone confirm that?
But I do not understand is how sgd is performed in this case. The minibatches seems entirely managed by RLlib
# sgd.py
def do_minibatch_sgd(samples, policies, local_worker, num_sgd_iter,
sgd_minibatch_size, standardize_fields):
"""Execute minibatch SGD.
"""
#...
for policy_id in policies.keys():
#...
for i in range(num_sgd_iter):
iter_extra_fetches = defaultdict(list)
for minibatch in minibatches(batch, sgd_minibatch_size):
batch_fetches = (local_worker.learn_on_batch(
MultiAgentBatch({
policy_id: minibatch
}, minibatch.count)))[policy_id]
# …
return fetches
Thus, I understand that whatever parallelization is being done, it is done at minibatch level. Is that right?
Lastly, as Abderrahim asks, it puzzles me how the forward/backward propagation are actually performed. I don’t see anywhere a declaration of a tensorflow distribution strategy. Can we pilot it in RLlib?