Restricting number of actors on a given node

Let’s say I have nodes each with K cpus and M gb of RAM.
And I have actors who each requires m gb of RAM, such that K*m > M.
That is, if I use an actor for each cpu, the node will start swapping, and I don’t want that. Is there a way to instruct ray to respect that, and not assign more actors to a node than what the memory allowed?

things i tried:

  • marking an actor as requiring 1.2 cpus. result: ray does not allow fractional cpus > 1.
  • marking an actor as requiring a resource such as {“r”: 1.0} and specifying in the configuration file that the machine type has some number of this resource (say, 9.0). this should allow up to 9 actors to run on this node. result: i am not sure why, but ray said that no machine can satisfy the requirements.

i am using google cloud, if it matters.

(on a related question, ray seems to know the number of cpus provided by the node type. how does it know that? or does it have to create them first?)

There’s a way to make your actor respect memory restriction; Memory Management — Ray v1.1.0

  • marking an actor as requiring a resource such as {“r”: 1.0} and specifying in the configuration file that the machine type has some number of this resource (say, 9.0). this should allow up to 9 actors to run on this node. result : i am not sure why, but ray said that no machine can satisfy the requirements.

Hmm this sounds a bit weird. Can you give me a short script I can try? Google cloud shouldn’t be related to this.

on a related question, ray seems to know the number of cpus provided by the node type

When ray is started without --num-cpus (e.g., ray start --address=$HEAD_NODE_ADDR), we automatically detect the num cpus from the machine.

Lastly, note that by default the actor requires 0 cpu. E.g,

@ray.remote
class A:
    pass

# is equal to

@ray.remote(num_cpus=0)
class A:
    pass

Here is some test code which I expect to spawn two worker machines, and it doesn’t:

import ray
import time
from ray.util import ActorPool

# my cluster will have 5 resources per node, so I exppect to allocate 2 nodes.

ray.init(address='auto')

@ray.remote(num_cpus=1.0, resources={"r": 1})
class MyActor:
    def __init__(self):
        pass

    def work(self):
        while True:
            time.sleep(2)


actors = [MyActor.remote() for _ in range(10)]
pool = ActorPool(actors)
for x in pool.map_unordered(lambda a, v: a.work.remote(), list(range(10))):
    pass

Here is the relevant part from the config, the rest should be pretty much as in the example (using the ray docker image):

worker_nodes:
    tags:
      - items: ["allow-all"]
    machineType: n1-standard-64
    resources:
      r: 5
    disks:
      - boot: true
        autoDelete: true
        type: PERSISTENT
        initializeParams:
          diskSizeGb: 40
          sourceImage: projects/deeplearning-platform-release/global/images/family/tf-1-13-cpu

this is what i see via ray monitor:

======== Autoscaler status: 2021-02-20 19:04:53.559852 ========
Node status

Healthy:
1 ray-legacy-head-node-type
Pending:
(no pending nodes)
Recent failures:
(no failures)

Resources

Usage:
0.00/1.465 GiB object_store_memory
0.0/2.0 CPU
0.00/4.346 GiB memory

Demands:
{‘r’: 1.0, ‘CPU’: 1.0}: 10+ pending tasks/actors
2021-02-20 19:04:53,571 INFO discovery.py:873 – URL being requested: GET https://compute.googleapis.com/compute/v1/projects/ai2-israel/zones/us-west1-a/instances?filter=((status+%3D+STAGING)+OR+(status+%3D+PROVISIONING)+OR+(status+%3D+RUNNING))+AND+(labels.ray-cluster-name+%3D+test1)&alt=json
2021-02-20 19:04:58,749 INFO discovery.py:873 – URL being requested: GET https://compute.googleapis.com/compute/v1/projects/ai2-israel/zones/us-west1-a/instances?filter=((labels.ray-node-type+%3D+worker))+AND+((status+%3D+STAGING)+OR+(status+%3D+PROVISIONING)+OR+(status+%3D+RUNNING))+AND+(labels.ray-cluster-name+%3D+test1)&alt=json
2021-02-20 19:04:58,905 INFO discovery.py:873 – URL being requested: GET https://compute.googleapis.com/compute/v1/projects/ai2-israel/zones/us-west1-a/instances?filter=((labels.ray-node-type+%3D+worker))+AND+((status+%3D+STAGING)+OR+(status+%3D+PROVISIONING)+OR+(status+%3D+RUNNING))+AND+(labels.ray-cluster-name+%3D+test1)&alt=json
2021-02-20 19:04:59,065 INFO discovery.py:873 – URL being requested: GET https://compute.googleapis.com/compute/v1/projects/ai2-israel/zones/us-west1-a/instances?filter=((labels.ray-node-type+%3D+unmanaged))+AND+((status+%3D+STAGING)+OR+(status+%3D+PROVISIONING)+OR+(status+%3D+RUNNING))+AND+(labels.ray-cluster-name+%3D+test1)&alt=json
2021-02-20 19:04:59,198 INFO discovery.py:873 – URL being requested: GET https://compute.googleapis.com/compute/v1/projects/ai2-israel/zones/us-west1-a/instances?filter=((status+%3D+STAGING)+OR+(status+%3D+PROVISIONING)+OR+(status+%3D+RUNNING))+AND+(labels.ray-cluster-name+%3D+test1)&alt=json
2021-02-20 19:04:59,354 WARNING resource_demand_scheduler.py:642 – The autoscaler could not find a node type to satisfy therequest: [{‘r’: 1.0, ‘CPU’: 1.0}, {‘r’: 1.0, ‘CPU’: 1.0}, {‘r’: 1.0, ‘CPU’: 1.0}, {‘r’: 1.0, ‘CPU’: 1.0}, {‘r’: 1.0, ‘CPU’: 1.0}, {‘r’: 1.0, ‘CPU’: 1.0}, {‘r’: 1.0, ‘CPU’: 1.0}, {‘r’: 1.0, ‘CPU’: 1.0}, {‘r’: 1.0, ‘CPU’: 1.0}, {‘r’: 1.0, ‘CPU’: 1.0}]. If this request is related to placement groups the resource request will resolve itself, otherwise please specify a node type with the necessary resource https://docs.ray.io/en/master/cluster/autoscaling.html#multiple-node-type-autoscaling.
2021-02-20 19:04:59,370 INFO discovery.py:873 – URL being requested: GET https://compute.googleapis.com/compute/v1/projects/ai2-israel/zones/us-west1-a/instances?filter=((labels.ray-node-type+%3D+worker))+AND+((status+%3D+STAGING)+OR+(status+%3D+PROVISIONING)+OR+(status+%3D+RUNNING))+AND+(labels.ray-cluster-name+%3D+test1)&alt=json
2021-02-20 19:04:59,529 INFO discovery.py:873 – URL being requested: GET https://compute.googleapis.com/compute/v1/projects/ai2-israel/zones/us-west1-a/instances?filter=((status+%3D+STAGING)+OR+(status+%3D+PROVISIONING)+OR+(status+%3D+RUNNING))+AND+(labels.ray-cluster-name+%3D+test1)&alt=json
2021-02-20 19:04:59,658 INFO autoscaler.py:305 –

Looks like your node doesn’t contain the custom resource r (based on your monitor output). (which is weird given you passed r:5). cc @Ameer_Haj_Ali @Alex o you guys know what’s the issue here?

NOTE: You can also see the comprehensive monitor status using ray status command

1 Like

Maybe it’s some weird default overrides. As a guess, maybe you could try doing:

resources:
   r: 5
   CPU: 5
1 Like

Hi @Yoav , thanks for filing this issue.
I would like to refer you to our cluster yaml configuration reference:
https://docs.ray.io/en/master/cluster/config.html#cluster-configuration-worker-nodes
The object that you put under worker_nodes is of type Node Config (Cluster YAML Configuration Options — Ray v2.0.0.dev0)
which translates to the configuration options we pass to the cloud provider (e.g. EC2). What you are looking for is using available_node_types with the resources field:
Cluster YAML Configuration Options — Ray v2.0.0.dev0

TLDR: You cannot use resources in worker_nodes because the only fields that can go under worker_nodes are the cloud provider configurations, instead you should use available_node_types.

1 Like

thanks @rliaw and @Ameer_Haj_Ali !

Actually, @rliaw 's solution worked! adding CPU: num under resources in the worker-config solved it.

I will try also the available_node_types solution, though I’ll admit it is hard for me to follow the syntax via the references there. (would be great if there could be a gcp example file for the multiple-node-types option)

1 Like