Training and inference ONLY using GPUs and no CPUs

Hi, I’m having a problem when using RLlib with GPUs. I want to evaluate training and rollout performance using a GPU. But I want to do it evaluating or training the model ONLY using the GPU, not a combination of GPU and CPU. I have tried to set the config keys num_gpus_for_driver and num_gpus_per_worker to 0, but this seems to have no effect. So, my questions is: Is there any way to guarantee that RLlib experiments run ONLY on a GPU and avoid simultaneous use of GPU and CPU?

PD.: I’m running my experiments with Pon-v0 environment and using PPO.

Ray version 1.1.0

Thanks in advance

I have also tried to fix it by setting num_cpus to 0 when initializing ray (ray.init(num_cpus=0)) but this seems also to have no effect…

num_gpus_per_worker sets the GPUs for rollout workers, not the learner process (i.e. if your environments need a GPU to function). To have a GPU learner, you must set num_gpus in the config:

   "num_gpus": 0.5

Hi @smorad and thanks for your answer. You are right, but my question was focused to avoid CPU usage in rollouts and trainings. I mean, when I set, for example, num_gpus=0.0001, num_gpus_per_worker=0.1, num_cpus_for_driver=0 and num_cpus_per_worker=0 what I expect is that all operations run ONLY on the GPU of my system (because I want to evaluate performance only using CPUs and only using GPUs).However, when I set up my agent with the resources that I mentioned, what I see is that even if I set num_workers=8 the processes run on all the 40 cores that the system I’m using has.

So my question was more addresed to know if there was a way to execute RLlib training and rollout running processes and opertaions only on GPUs and with no usage of CPUs.

@javigm98 it is not possible to have all operations run on the gpu. There are lots of parts of ray and rllib, the majority in fact, that can only run on the cpu. The only part that you can provision to run on the gpus are the tf/torch parts. There are two parts to selecting the number of cpus used. When you start ray there is an option num_cpus. The default if you do not provide one is the total count of cpus on the system. You can specify fewer to override that number. If you specify 0 it will either not run or change it to 1 I am not sure. The second part is when you request ray to schedule an rllib trainer this is where the options you mentioned in your post factor in.

Thanks for your answer @mannyv but the key is that even when I set in the agent’s config num_cpus_per_worker=0 and num_cpus_per_driver=0 ray continues scheduling worker and driver jobs in the CPUs and CPU percentage of use is the same that when I set these two values to be 1, for example. So what I really wanted to know is if there is any way to shecdule learning tasks out of the CPUs. I know that there are some agent config params such as tf_session_args={"device_count":{"CPU":1}} or the same with local_tf_session_args but I really don’t know if changing these makes sense in my case.

Hi @javigm98,

As an rllib end user you can think of ray as asynchronous distributed job scheduler. Think of it like this; ray is running a number of schedulers and code executors on on machine or a whole cluster of machines. You as an end user have some code that you want ray to run for you. Ray’s job is to figure out where it should run based on a set of resource requirements that you provide (through the config). Here is a key part: once ray starts a job it has very little to no control over how many resources that job is going to consume. You could ask it to run a task that sleeps the whole time and so it would use virtually no resources (cpu or gpu) at all. On the other hand you could write a piece of code that determines the total number of physical resources and uses them all for almost the whole time.

So if ray has little control over what is actually running once it starts then what is the point of all the configuration parameters specifying cpus and gpus? This is where its job as a scheduler comes into play. When you start ray either you tell it the resources available to it (number of cpus, gpus, memory, special licenses, …) or if you don’t it uses some pre-written rules to determine it automatically. Now ray is up and running and knows how many resources it is managing.

When you ask ray to train an RL algorithm with rllib you are scheduling jobs for ray to run. In order to make sure that it does not allocate more jobs than can be handled at one time it needs to know the resource requirements of the job you want to run. Once it knows that it can determine whether it can start the job now or if it has to wait until already running tasks complete so that it has the required resources.

Lets say you started ray with 4 cps and you have a running task that told ray it will use 2 of them. If you try and start a second job that needs 3 then ray will schedule your job but it cannot run because you need 3 but only 2 are currently free. It will wait until the first one stops so that it can allocate 1 of those cpus. When you tell ray that an rllib job needs 0 cpus then you are saying that it can run as many jobs as it wants. You could start 100 rllib trainings at the same time but it will surely fail by running out of memory at some point and probably run extremely slow before that happens. That will happen because you did not giving ray a realistic set of requirements for the jobs you actually ran and it oversubscribed the system.

When you tell ray the driver requires 0 cpus you are NOT imposing a constraint on the driver. The driver is going to use however much of the cpu it needs to to execute the instructions required by its implementation. You are merely telling ray how much the driver will use so that ray knows to set that number aside when asked to run other things.

When you inference but more importantly train a neural network it may require a lot of compute intensive resources. In ray and rllib the neural network part of the system is the part that will benefit from the specialized capabilities of the gpu. There are a lot of parts of rllib that will not. Running the environment; collecting, storing and retrieving samples; managing the: sample, store, train, update pipeline; communication over the network with other machines, the redis backend, etc…; logging data to files. Those kinds of tasks don’t run on the gpu and never will.

Given your goal what you really can hope to do is try and set up a configuration where the amount of cpu used to do things other than neural network computations is comparable between your dependent variables which are the amount of time, iterations, samples, … required when using the neural network on the cpu or on the gpu.

Hope this helps,

1 Like

Hi again @mannyv and thank you so much for the answer. I think that now everything I wanted to know is clear to me and I have understood a bit more how Ray really works. Thanks a lot!