I have a CPU bound pandas job that runs slightly faster with @ray.remote(num_cpus=2) than @ray.remote(num_cpus=1)
From the docs, num_cpus is the quantity of CPU resources to reserve for a given task. So my understanding is that with num_cpus=2, each worker node is only utilizing half the available CPU cores and yet the whole job is still running faster.
Is there any way to limit the Ray cluster CPU utilization to 90% of the CPU count? I’ve tried setting num_cpus to a fraction like num_cpus=1.5 but this is not supported for values > 1.
Ray does not do any affinity handling of CPUs. Ray resources are logical, not physical (with the exception of GPUs). When you assign “CPUs”, you are simply disallowing another task from taking that portion of “CPUs”. But Ray does nothing to prevent your task that took 1 “CPU” from using 10 physical cores underneath (e.g. n_jobs=10 in scikitlearn or other library). So you could very easily accidentally oversubscribe the physical cores if you don’t pay attention to what the task is actually doing.
Please have a look at the doc where this is described in more detail:
The fact that resources are logical has several implications:
Resource requirements of tasks or actors do NOT impose limits on actual physical resource usage. For example, Ray doesn’t prevent a num_cpus=1 task from launching multiple threads and using multiple physical CPUs. It’s your responsibility to make sure tasks or actors use no more resources than specified via resource requirements.
Ray doesn’t provide CPU isolation for tasks or actors. For example, Ray won’t reserve a physical CPU exclusively and pin a num_cpus=1 task to it. Ray will let the operating system schedule and run the task instead. If needed, you can use operating system APIs like sched_setaffinity to pin a task to a physical CPU.
Ray does provide GPU isolation in the form of visible devices by automatically setting the CUDA_VISIBLE_DEVICES environment variable, which most ML frameworks will respect for purposes of GPU assignment.