Usage of CPU resource on RayCluster GCloud

How severe does this issue affect your experience of using Ray?
my task, but I can work around it.

  • High: It blocks me to complete my task.

Hello, I am encountering the following issue. After launching a cluster using the command ray up cluster_config.yaml and running my script via ray job submit, I realize that the resource used by each worker is equivalent to 100% of a single CPU for each worker running on the VMs. However, I have configured the necessary resource in the decorator @ray.remote(num_cpus=16, resources={"worker":1}) and I have taken care to add the arguments:

image: "rayproject/ray-ml:latest-cpu"
container_name: "ray_container"
pull_before_run: True
run_options:
  - --ulimit nofile=65536:65536
  - --cpuset-cpus="0-15"

and

available_node_types:
  ray_head_default:
    resources: {"CPU": 16, "memory": 68719476736,"head":1}

in my configuration file. The execution is being tested on n2 standard 16 instances.

I should mention that locally, Ray uses all the available CPU resources on my machine. Moreover, I observe the following behavior: when I run nproc in a shell joined via ray attach, the response is 1. When I run nproc directly on the Docker image, I get the value 16, which is correct. I have conducted multiple tests, and systematically the CPU resource used remains stuck at one processor unit, which is about 6 to 7% in my case. When I run htop, I can see that there is always only one processor at 100%, while the others are idle.

can you share your application code; smells like something at that layer (if you forgot to decorate @ray.remote somewhere)

Thank you Sam
I have researched in this direction, but I have not forgotten to decorate my remote classes,

if os.getenv("RAY_BACKEND") == 'cuda':
    num_gpus = 1
    num_cpus = os.cpu_count()
else:
    num_gpus = 0
    num_cpus = os.cpu_count()

@ray.remote(num_cpus=num_cpus,num_gpus=num_gpus)

The fact that you are decorating your task @ray.remote(num_cpus=16, ...) does not make it use all available CPU automatically. If the task it’s running single-threaded code, then it’ll only be confined to a single CPU. If the code is written as multi-threaded, could there be some environmental factors change the way code is running in your cluster?

What are you running your Ray cluster on? You mentioned VMs, but the config snippet looks like docker-compose config. If you have all workers running on the same server, they will compete for the same CPUs.

When I run nproc directly on the Docker image, I get the value 16, which is correct.

Assuming I’m correct about docker-compose and a single server. What does it show if you run nproc in the running worker container via ‘docker exec’.

Thank you for your responses. I finally understood why. It was the definition of torch.set_num_threads() that was causing the problem. In my development environment, if I choose torch.set_num_threads(num_cpus), it generates oversubscription and significantly slows down my code, so I left the default value. On the nodes of my VMs, it is necessary to select torch.set_num_threads(num_cpus).

1 Like