Usage of CPU resource on RayCluster GCloud

Guillaume_Berthelot · July 26, 2024, 4:33pm

How severe does this issue affect your experience of using Ray?
my task, but I can work around it.

High: It blocks me to complete my task.

Hello, I am encountering the following issue. After launching a cluster using the command ray up cluster_config.yaml and running my script via ray job submit, I realize that the resource used by each worker is equivalent to 100% of a single CPU for each worker running on the VMs. However, I have configured the necessary resource in the decorator @ray.remote(num_cpus=16, resources={"worker":1}) and I have taken care to add the arguments:

image: "rayproject/ray-ml:latest-cpu"
container_name: "ray_container"
pull_before_run: True
run_options:
  - --ulimit nofile=65536:65536
  - --cpuset-cpus="0-15"

and

available_node_types:
  ray_head_default:
    resources: {"CPU": 16, "memory": 68719476736,"head":1}

in my configuration file. The execution is being tested on n2 standard 16 instances.

I should mention that locally, Ray uses all the available CPU resources on my machine. Moreover, I observe the following behavior: when I run nproc in a shell joined via ray attach, the response is 1. When I run nproc directly on the Docker image, I get the value 16, which is correct. I have conducted multiple tests, and systematically the CPU resource used remains stuck at one processor unit, which is about 6 to 7% in my case. When I run htop, I can see that there is always only one processor at 100%, while the others are idle.

Sam_Chan · July 29, 2024, 3:43pm

can you share your application code; smells like something at that layer (if you forgot to decorate @ray.remote somewhere)

Guillaume_Berthelot · July 30, 2024, 12:09pm

Thank you Sam
I have researched in this direction, but I have not forgotten to decorate my remote classes,

if os.getenv("RAY_BACKEND") == 'cuda':
    num_gpus = 1
    num_cpus = os.cpu_count()
else:
    num_gpus = 0
    num_cpus = os.cpu_count()

@ray.remote(num_cpus=num_cpus,num_gpus=num_gpus)

lobanov · July 30, 2024, 3:02pm

The fact that you are decorating your task @ray.remote(num_cpus=16, ...) does not make it use all available CPU automatically. If the task it’s running single-threaded code, then it’ll only be confined to a single CPU. If the code is written as multi-threaded, could there be some environmental factors change the way code is running in your cluster?

What are you running your Ray cluster on? You mentioned VMs, but the config snippet looks like docker-compose config. If you have all workers running on the same server, they will compete for the same CPUs.

When I run nproc directly on the Docker image, I get the value 16, which is correct.

Assuming I’m correct about docker-compose and a single server. What does it show if you run nproc in the running worker container via ‘docker exec’.

Guillaume_Berthelot · August 2, 2024, 2:31pm

Thank you for your responses. I finally understood why. It was the definition of torch.set_num_threads() that was causing the problem. In my development environment, if I choose torch.set_num_threads(num_cpus), it generates oversubscription and significantly slows down my code, so I left the default value. On the nodes of my VMs, it is necessary to select torch.set_num_threads(num_cpus).

Topic		Replies	Views
Using a subset of available CPUs	2	483	December 12, 2020
CPU cores, CPU threads, and scaling of Ray tasks Ray Core	1	220	June 25, 2024
Not able to use second worker cpu and memory Ray Clusters	2	345	May 25, 2022
Ray only using one CPU core but detects all resources Ray Core	4	1282	July 20, 2023
Resources not being used Ray Core	4	1276	September 21, 2021

Usage of CPU resource on RayCluster GCloud

Related topics