How does multi-CPU work within Ray?

Abderrahim · April 8, 2021, 9:46am

Hello,

I am training a PPO agent on a 16 CPUs cluster on Azure with a custom environment.
If I understand well, there is a balance to be find between the number of CPUs assigned to the workers for the sampling task (num_worker*num_cpu_per_worker) and the number of CPU assigned to the learning task (num_cpu_for_driver).

However, with num_cpu_for_driver=1 I get a mean time per iteration of 5.5s and increasing the number of CPUs to 13 only passes the mean time per iteration to approximately 2s.

So, how does multi-CPU work in Ray?
Are the minibatches back-propagated within only one CPU or are they dispatched accross the num_cpu_for_drivers CPUs?
Is it weird that the learning time doesn’t improve that much when increasing the number of CPU by 13?

Thanks,
Abderrahim

sangcho · April 8, 2021, 6:32pm

cc @kai Can you address the question?

kai · April 8, 2021, 7:19pm

Actually let me cc @sven1977 for questions around RLLib algorithms.

Generally I think the driver can leverage multiple threads only to a certain degree, and it might also be limited by the incoming number of worker batches. But I’m not familiar enough with the PPO implementation to know if this might be the case or if we always should have enough data to train on.

GattiPinheiro · April 9, 2021, 8:28am

It is not shocking that at certain point, there is not enough work to split (and there is a fixed cost of managing multiple workers). Thus, it is not because you multiply the number of CPUs by 13 that your running time will divide by 13. However, I’m quite curious to understand about how parallelization works in a multi-CPU environment.

The settings @Abderraim is citing looks like

tune.run(
    config={
        #...
        # Parallel Training CPU
        'num_cpus_for_driver': 13,
         'tf_session_args': {
            'intra_op_parallelism_threads': 0,
            'inter_op_parallelism_threads': 0,
                'device_count': {
                 'CPU': 13,
             }
          },
         'local_tf_session_args': {
            'intra_op_parallelism_threads': 0,
            'inter_op_parallelism_threads': 0,
        },
        #...
}

Thus, as we specify the number of cpus available in the tensorflow session, I understand that the parallelization happens at tensorflow level. Can anyone confirm that?

But I do not understand is how sgd is performed in this case. The minibatches seems entirely managed by RLlib


# sgd.py

def do_minibatch_sgd(samples, policies, local_worker, num_sgd_iter,
                     sgd_minibatch_size, standardize_fields):
    """Execute minibatch SGD.
    """
    #...
    for policy_id in policies.keys():
       #...
        for i in range(num_sgd_iter):
            iter_extra_fetches = defaultdict(list)
            for minibatch in minibatches(batch, sgd_minibatch_size):
                batch_fetches = (local_worker.learn_on_batch(
                    MultiAgentBatch({
                        policy_id: minibatch
                    }, minibatch.count)))[policy_id]
                # …
    return fetches

Thus, I understand that whatever parallelization is being done, it is done at minibatch level. Is that right?

Lastly, as Abderrahim asks, it puzzles me how the forward/backward propagation are actually performed. I don’t see anywhere a declaration of a tensorflow distribution strategy. Can we pilot it in RLlib?

sven1977 · April 21, 2021, 7:09am

Hey @GattiPinheiro , a multi-CPU setup to currently is not supported by RLlib. At least not in the data-parallel way that we have for multi-GPU (splitting batch, passing it through parallel towers and averaging over the resulting gradients). The intra_op_parallelism_threads and inter_op_parallelism_threads “only” help with e.g. parallelizing matrix multiplications in your layers (which may help speed up a little) or parallelizing between independent ops (which you usually don’t have in a feed-forward NN).
Our upper level minibatching and RL algo learning patterns are unaffected by these settings.

Topic		Replies	Views
Most efficient way to use only a CPU for training RLlib	3	3108	April 22, 2021
Question about resource management in Ray Ray Core	22	1939	April 24, 2021
How many workers? Best way to determine number of workers? RLlib	3	1996	January 3, 2023
CPU using all cores despite config Configure Algorithm, Training, Evaluation, Scaling	0	16	December 18, 2024
Train with RLlib using multiple CPU with slrum Configure Algorithm, Training, Evaluation, Scaling	1	533	November 30, 2022

How does multi-CPU work within Ray?

Related topics