Restricting number of actors on a given node

Yoav · February 21, 2021, 1:13am

Let’s say I have nodes each with K cpus and M gb of RAM.
And I have actors who each requires m gb of RAM, such that K*m > M.
That is, if I use an actor for each cpu, the node will start swapping, and I don’t want that. Is there a way to instruct ray to respect that, and not assign more actors to a node than what the memory allowed?

things i tried:

marking an actor as requiring 1.2 cpus. result: ray does not allow fractional cpus > 1.
marking an actor as requiring a resource such as {“r”: 1.0} and specifying in the configuration file that the machine type has some number of this resource (say, 9.0). this should allow up to 9 actors to run on this node. result: i am not sure why, but ray said that no machine can satisfy the requirements.

i am using google cloud, if it matters.

(on a related question, ray seems to know the number of cpus provided by the node type. how does it know that? or does it have to create them first?)

sangcho · February 21, 2021, 2:10am

There’s a way to make your actor respect memory restriction; https://docs.ray.io/en/latest/memory-management.html#memory-aware-scheduling

marking an actor as requiring a resource such as {“r”: 1.0} and specifying in the configuration file that the machine type has some number of this resource (say, 9.0). this should allow up to 9 actors to run on this node. result : i am not sure why, but ray said that no machine can satisfy the requirements.

Hmm this sounds a bit weird. Can you give me a short script I can try? Google cloud shouldn’t be related to this.

on a related question, ray seems to know the number of cpus provided by the node type

When ray is started without --num-cpus (e.g., ray start --address=$HEAD_NODE_ADDR), we automatically detect the num cpus from the machine.

Lastly, note that by default the actor requires 0 cpu. E.g,

@ray.remote
class A:
    pass

# is equal to

@ray.remote(num_cpus=0)
class A:
    pass

Yoav · February 21, 2021, 3:02am

Here is some test code which I expect to spawn two worker machines, and it doesn’t:

import ray
import time
from ray.util import ActorPool

# my cluster will have 5 resources per node, so I exppect to allocate 2 nodes.

ray.init(address='auto')

@ray.remote(num_cpus=1.0, resources={"r": 1})
class MyActor:
    def __init__(self):
        pass

    def work(self):
        while True:
            time.sleep(2)


actors = [MyActor.remote() for _ in range(10)]
pool = ActorPool(actors)
for x in pool.map_unordered(lambda a, v: a.work.remote(), list(range(10))):
    pass

Here is the relevant part from the config, the rest should be pretty much as in the example (using the ray docker image):

worker_nodes:
    tags:
      - items: ["allow-all"]
    machineType: n1-standard-64
    resources:
      r: 5
    disks:
      - boot: true
        autoDelete: true
        type: PERSISTENT
        initializeParams:
          diskSizeGb: 40
          sourceImage: projects/deeplearning-platform-release/global/images/family/tf-1-13-cpu

Yoav · February 21, 2021, 3:07am

this is what i see via ray monitor:

======== Autoscaler status: 2021-02-20 19:04:53.559852 ========
Node status

Healthy:
1 ray-legacy-head-node-type
Pending:
(no pending nodes)
Recent failures:
(no failures)

Resources

Usage:
0.00/1.465 GiB object_store_memory
0.0/2.0 CPU
0.00/4.346 GiB memory

sangcho · February 21, 2021, 3:17am

Looks like your node doesn’t contain the custom resource r (based on your monitor output). (which is weird given you passed r:5). cc @Ameer_Haj_Ali @Alex o you guys know what’s the issue here?

NOTE: You can also see the comprehensive monitor status using ray status command

rliaw · February 21, 2021, 8:36am

Maybe it’s some weird default overrides. As a guess, maybe you could try doing:

resources:
   r: 5
   CPU: 5

Ameer_Haj_Ali · February 21, 2021, 2:07pm

Hi @Yoav , thanks for filing this issue.
I would like to refer you to our cluster yaml configuration reference:
https://docs.ray.io/en/master/cluster/config.html#cluster-configuration-worker-nodes
The object that you put under worker_nodes is of type Node Config (Cluster YAML Configuration Options — Ray v2.0.0.dev0)
which translates to the configuration options we pass to the cloud provider (e.g. EC2). What you are looking for is using available_node_types with the resources field:
Cluster YAML Configuration Options — Ray v2.0.0.dev0

TLDR: You cannot use resources in worker_nodes because the only fields that can go under worker_nodes are the cloud provider configurations, instead you should use available_node_types.

Yoav · February 21, 2021, 5:44pm

thanks @rliaw and @Ameer_Haj_Ali !

Actually, @rliaw 's solution worked! adding CPU: num under resources in the worker-config solved it.

I will try also the available_node_types solution, though I’ll admit it is hard for me to follow the syntax via the references there. (would be great if there could be a gcp example file for the multiple-node-types option)

Topic		Replies	Views
Ray Worker Max Memory Ray Core	3	495	February 5, 2021
How to limit cpus used on each worker [Autoscaler] Ray Core	4	857	February 15, 2021
Is ray setting memory resource? Ray Core	8	387	August 8, 2024
Ensure only 1 actor runs on a node at a time Ray Clusters	1	374	August 10, 2022
Too many actors cause worker is done Ray Core	1	33	July 8, 2025

Restricting number of actors on a given node

======== Autoscaler status: 2021-02-20 19:04:53.559852 ======== Node status

Resources

Related topics

======== Autoscaler status: 2021-02-20 19:04:53.559852 ========
Node status