Hi all. I’m running Ray in a machine with 40 CPU cores and 2 GPUs and I want to reduce CPU usage to only one core (in order to let the other free to execute other programs). I tried to set num_cpus=1 when initializing ray (ray.init(num_cpus=1)
) but the results was that althoug I set 0 gpus for workers and dirver in my PPO agent, the 40 cores of my machine were used while training. So my question is if there is any way to avoid ray from using more CPUs than specified, like hiding these resources.
I tried to forze the process where I used Ray to use only one CPU (by setting in the same script os.sched_setaffinity(0,{0})
. By doing that I achieved apparently what I wanted: to free 39 of the 40 CPUs. But the key is that when examining metrics that I obtanied by doing this with an 8 workers-PPO agent it seemed that sample time of each training iteration was so high if compared with the normal case where all CPUs where used. So I want to know how ray created threads, in order to know if this sample time overhead is due to have 8 workers running on only one CPU or if teh cause is that ray continues considering that 40 CPUs are available and plans tasks to be executed with these resources, finding out that it can run this way and all of them have to be executed in the same CPU.
So my questions are:
- Is there any other way to force Ray to execute only in one CPU in a multi-CPU system?
- How does ray cretes threads and tasks for RLLib: considering resources or considering the number of workers specified?
- Is it possible that the underlying Tensorflow framework is the responsible for using all 40 CPUs even when specifiying ray in its initialization to use only one of them? In this case, is there any to limit it better than using the function
os.seched_setaffinity
?
Thanks so much in advance!