Most efficient way to use only a CPU for training

Hi all. I’m running Ray in a machine with 40 CPU cores and 2 GPUs and I want to reduce CPU usage to only one core (in order to let the other free to execute other programs). I tried to set num_cpus=1 when initializing ray (ray.init(num_cpus=1)) but the results was that althoug I set 0 gpus for workers and dirver in my PPO agent, the 40 cores of my machine were used while training. So my question is if there is any way to avoid ray from using more CPUs than specified, like hiding these resources.

I tried to forze the process where I used Ray to use only one CPU (by setting in the same script os.sched_setaffinity(0,{0}). By doing that I achieved apparently what I wanted: to free 39 of the 40 CPUs. But the key is that when examining metrics that I obtanied by doing this with an 8 workers-PPO agent it seemed that sample time of each training iteration was so high if compared with the normal case where all CPUs where used. So I want to know how ray created threads, in order to know if this sample time overhead is due to have 8 workers running on only one CPU or if teh cause is that ray continues considering that 40 CPUs are available and plans tasks to be executed with these resources, finding out that it can run this way and all of them have to be executed in the same CPU.

So my questions are:

  • Is there any other way to force Ray to execute only in one CPU in a multi-CPU system?
  • How does ray cretes threads and tasks for RLLib: considering resources or considering the number of workers specified?
  • Is it possible that the underlying Tensorflow framework is the responsible for using all 40 CPUs even when specifiying ray in its initialization to use only one of them? In this case, is there any to limit it better than using the function os.seched_setaffinity?

Thanks so much in advance!

Hey @javigm98 , great question!
a) If you are running RLlib through tune.run() and initialize via ray.init(num_cpus=1) and have num_workers>0, you should see tune blocking the run due to lack of (CPU) resources. For normal algos, you need num_workers+1 CPUs (1 for the driver/local-worker process).
b) If you run “directly”, using an RLlib Trainer instance and calling .train() on this instance repeatedly, then you should see that only num_workers CPUs are being required in your ray.init().

Also not that you can specify more CPUs than your machine actually has in ray.init, which will lead to some workers sharing a CPU (and thus not running truly in parallel).

Hi @sven1977 so if what I want is to let free a big part of my machine CPUs, it is a good idea to configurate the trainer using only two workers and initializing ray with num_cpus=3? Is this going to run the processes only in 3 of the CPUs and let the other 37 totally free? (I mean when calling agent.train())

Yes, it should. If you say ray.init(num_cpus=3), no more than 3 actual CPUs should be used.