Number of CPUs detected by ray on systems with different CPUs

On a Linux Ubuntu 18.04.5 system with an Intel(R) Xeon(R) Platinum 8259CL CPU, python’s (3.8.8) multiprocessing module detects 64 CPUs. If I try to initialize ray (1.13.0) on said system, it detects 2 CPUs. On a Linux system with an AMD EPYC 7R32 CPU, multiprocessing detects 192 cores, while ray (same version) detects 190 CPUs. Any thoughts as to why ray isn’t seeing all of the cores on the system with the Intel CPU?

Hey @lebedov , thanks for flagging this, this doesn’t right for sure!

When you say ray detects X CPUs, which functions/commands you used from ray?

ray.init() followed by ray.nodes(); I’m alluding to what the latter reports.

L

Hey @lebedov , sorry for the delay in replying (missed this reply)

ray.nodes() look at the resources currently available for scheduling, taking into account states of nodes as well. Depending on the state of the clusters (whether nodes are registered or alive), the number might appear different.

Would you confirm that the number of alive nodes are the same and the cluster is in similar states on the two environment as well?

For each of the two scenarios I mentioned, the cluster consists only of a single node (i.e., the one on which ray.init() is executed), which also is the only node reported by ray.nodes().

I just noticed that if I call ray.init() with num_cpus=X on the 64-core Intel Xeon system with X set to some number higher than 2 , ray.nodes() subsequently reports that number of CPUs in the detected node - even when X is greater than 64.

I’m not sure If I understand this. So by default Ray will auto detect the number of cpus on the machine, and set num_cpus (the number of logical cpu resources) to that but you can also override it to whatever number you like by explicitly setting num_cpus=X during ray.init(). So you are saying Ray failed to detect the correct number of physical cpus in your case, is my understanding correct?

Yes.

I mentioned the observation re num_cpus because it wasn’t clear to me whether being able to manually set a number of CPUs greater than what was detected is expected behavior.

L

If you think the number of cpus ray detect is wrong, you can just set num_cpus explicitly to the correct number, this is totally fine.

Although the fact that Ray cannot detect the correct number of CPUs is a bug to me. I may not have the exact system to reproduce and debug it. The code that detects number of cpus is ray._private.utils.get_num_cpus(). If you look at the implementation it basically uses multiprocessing.cpu_count() unless it’s running inside a docker. Is it the case for you? Also if you set env var RAY_USE_MULTIPROCESSING_CPU_COUNT=1, does ray detect the correct number of cpus?

I am indeed running ray in Ubuntu on a Docker image. Setting RAY_USE_MULTIPROCESSING_CPU_COUNT=1 does enable it detect the right number of CPUs (i.e., what I can see in /proc/cpuinfo). Thanks!

@Alex seems we detect number of cpus inside docker wrong.

Do we have enough info to repro here? what flags are set in the container?

Do you mean the runtime flags?

The container is being run through Domino Data Lab on AWS. I’ll have to check with the admins to see what flags are used, as those are not exposed to users.

The folks at Domino told me that they don’t have direct control over the flags set in the container, but they said that if I obtain the pod spec used on Amazon EKS, one could use it to replicate the container configuration in which I observed the problematic CPU count detection. I’m looking into how to obtain said spec.

1 Like