Hi, it doesn’t matter what I pass as the head node’s max_workers it always uses all the CPUs available in the vm.
Did I missunderstand this attribute or is there a bug?
Reproduce with the following:
cluster_name: matan
max_workers: 10
provider:
type: gcp
region: us-west1
availability_zone: us-west1-a
project_id: ai2-israel
auth:
ssh_user: ray
available_node_types:
head_node:
min_workers: 0
max_workers: 1
resources: {"CPU": 4}
node_config:
machineType: n1-highmem-4
tags:
- items: ["allow-all"]
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 50
sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu-debian-9
worker_node:
min_workers: 0
resources: {"CPU": 2}
node_config:
machineType: n1-highmem-2
tags:
- items: ["allow-all"]
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 50
sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu-debian-9
scheduling:
- preemptible: false
head_node_type: head_node
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- >-
ulimit -n 65536;
ray start
--head
--port=6379
--object-manager-port=8076
--autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- >-
ulimit -n 65536;
ray start
--address=$RAY_HEAD_IP:6379
--object-manager-port=8076
import ray
import logging
from time import sleep
import ray
import ray.autoscaler.sdk
logger = logging.getLogger(__file__)
@ray.remote(num_cpus=1.0, max_restarts=-1, max_task_retries=-1)
class Reproduce(object):
def run(self):
for i in range(10, -1, -1):
sleep(1)
return 4
ray.init(address="auto")
arr = []
for _ in range(40):
c = Reproduce.remote()
arr.append(c.run.remote())
print(ray.get(arr))
In this example, the head node will run “run” method on 4 cpus instead of only on a single one.