How many workers? Best way to determine number of workers?

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity

I am running trials on an Intel Core ui7-6700K CPU with 2 threads per core and 4 cores and a NVIDIA GTX 960 ( which I think has 1024 cuda cores ). I am currently using 6 workers.

(a) How many workers do you think this set up could support for a PPO algorithm?

(b) What is the best way to determine the number of workers for a setup such as this?

(c) I was reading a paper that used “two Intel Xeons with 40 cores” with RLLIB and failed to mention how many workers they used. Would I be wrong in assuming that they had 40 workers ( num_workers = 39 )? I feel like this is wrong because Getting Started with RLlib — Ray 2.2.0 suggests that greater than 10 workers is a lot.

(a) and (b) that depends on a number of factors like how fast your workers sample and how quick an iteration of training happens. There is no way to give a general formula for this. Generally you’ll simply want to scale up workers as much that your GPU waits the least amount of time possible for the next batch and becomes the bottleneck. You can grid_search with tune and have a look at the training iteration timers.
(c) Some users need 100 workers. Others even more. All of this is relative and the best number is different for each case.

1 Like

@arturn Happy New Year!

Thank you for your response.

I am still a little confused on somethings and I think it is because I am having a hard time thinking about how a training algorithm works with both the CPU and GPU.

(i) In an environment with only one CPU w/ 8 threads, could I have more workers than threads on my cpu? Would it ever make sense to have more workers than threads on my cpu? For example, could I have num_workers=9 even though I have 8 threads? Would this ever make sense?

(ii) Does the GPU take on worker processes solely or does the GPU work with a thread on the CPU and perform parallel operations of the algorithm? For example, if I had an environment that had a CPU w/ 8 threads and a GPU w/ 1024 cuda cores and I set num_workers = 40, would a core worker + 7 workers be set up for the 8 threads on the CPU and the other 33 workers be set up on the GPU? This is the main question that is throwing me off. I don’t know how or if a GPU can run worker processes.

(i) You can not have more workers than CPUS, because they each have a main thread.
An actor can, however, open another thread. You can’t set threads anywhere, only the number of workers, and these will each live in their own thread. So normally num_threads>=num_workers + 1 (+1 being the driver thread). It would probably be helpful if you read a little about Ray Actors if you want to fully understand this because RolloutWorkers are Ray Actors.

(ii) Each worker needs a CPU to run on. You can also give them GPUs as additional resources to (in almost any case to speed on operations related to ANNs) and they can be shared between workers. Again, it would be helpful if you read about Ray Actors here.

1 Like