Number of CPUs detected by ray on systems with different CPUs

lebedov · August 3, 2022, 8:52pm

On a Linux Ubuntu 18.04.5 system with an Intel(R) Xeon(R) Platinum 8259CL CPU, python’s (3.8.8) multiprocessing module detects 64 CPUs. If I try to initialize ray (1.13.0) on said system, it detects 2 CPUs. On a Linux system with an AMD EPYC 7R32 CPU, multiprocessing detects 192 cores, while ray (same version) detects 190 CPUs. Any thoughts as to why ray isn’t seeing all of the cores on the system with the Intel CPU?

rickyyx · August 4, 2022, 10:43pm

Hey @lebedov , thanks for flagging this, this doesn’t right for sure!

When you say ray detects X CPUs, which functions/commands you used from ray?

lebedov · August 5, 2022, 2:18am

ray.init() followed by ray.nodes(); I’m alluding to what the latter reports.

L

rickyyx · August 9, 2022, 2:18am

Hey @lebedov , sorry for the delay in replying (missed this reply)

ray.nodes() look at the resources currently available for scheduling, taking into account states of nodes as well. Depending on the state of the clusters (whether nodes are registered or alive), the number might appear different.

Would you confirm that the number of alive nodes are the same and the cluster is in similar states on the two environment as well?

lebedov · August 9, 2022, 4:28am

For each of the two scenarios I mentioned, the cluster consists only of a single node (i.e., the one on which ray.init() is executed), which also is the only node reported by ray.nodes().

I just noticed that if I call ray.init() with num_cpus=X on the 64-core Intel Xeon system with X set to some number higher than 2 , ray.nodes() subsequently reports that number of CPUs in the detected node - even when X is greater than 64.

jjyao · August 10, 2022, 4:57pm

I’m not sure If I understand this. So by default Ray will auto detect the number of cpus on the machine, and set num_cpus (the number of logical cpu resources) to that but you can also override it to whatever number you like by explicitly setting num_cpus=X during ray.init(). So you are saying Ray failed to detect the correct number of physical cpus in your case, is my understanding correct?

lebedov · August 12, 2022, 1:53am

Yes.

I mentioned the observation re num_cpus because it wasn’t clear to me whether being able to manually set a number of CPUs greater than what was detected is expected behavior.

L

jjyao · August 12, 2022, 6:27pm

If you think the number of cpus ray detect is wrong, you can just set num_cpus explicitly to the correct number, this is totally fine.

Although the fact that Ray cannot detect the correct number of CPUs is a bug to me. I may not have the exact system to reproduce and debug it. The code that detects number of cpus is ray._private.utils.get_num_cpus(). If you look at the implementation it basically uses multiprocessing.cpu_count() unless it’s running inside a docker. Is it the case for you? Also if you set env var RAY_USE_MULTIPROCESSING_CPU_COUNT=1, does ray detect the correct number of cpus?

lebedov · August 12, 2022, 7:40pm

I am indeed running ray in Ubuntu on a Docker image. Setting RAY_USE_MULTIPROCESSING_CPU_COUNT=1 does enable it detect the right number of CPUs (i.e., what I can see in /proc/cpuinfo). Thanks!

jjyao · August 12, 2022, 7:56pm

@Alex seems we detect number of cpus inside docker wrong.

Alex · August 12, 2022, 8:01pm

Do we have enough info to repro here? what flags are set in the container?

lebedov · August 21, 2022, 3:13am

Do you mean the runtime flags?

The container is being run through Domino Data Lab on AWS. I’ll have to check with the admins to see what flags are used, as those are not exposed to users.

lebedov · August 29, 2022, 4:42pm

The folks at Domino told me that they don’t have direct control over the flags set in the container, but they said that if I obtain the pod spec used on Amazon EKS, one could use it to replicate the container configuration in which I observed the problematic CPU count detection. I’m looking into how to obtain said spec.

Topic		Replies	Views
Explicitly setting number of CPUs on node Ray Clusters	1	332	September 30, 2022
Ray init fails to register workers Ray Core	9	2885	August 17, 2022
Ray_node_cpu_count mismatches resource constraint Ray Core	2	334	September 27, 2022
CLUSTER initialization with cpus Ray Clusters	1	14	July 31, 2024
Too many pyhton processes on Node Ray Clusters	2	328	January 18, 2023

Number of CPUs detected by ray on systems with different CPUs

Related topics