Let’s say I have nodes each with K cpus and M gb of RAM.
And I have actors who each requires m gb of RAM, such that K*m > M.
That is, if I use an actor for each cpu, the node will start swapping, and I don’t want that. Is there a way to instruct ray to respect that, and not assign more actors to a node than what the memory allowed?
things i tried:
marking an actor as requiring 1.2 cpus. result: ray does not allow fractional cpus > 1.
marking an actor as requiring a resource such as {“r”: 1.0} and specifying in the configuration file that the machine type has some number of this resource (say, 9.0). this should allow up to 9 actors to run on this node. result: i am not sure why, but ray said that no machine can satisfy the requirements.
i am using google cloud, if it matters.
(on a related question, ray seems to know the number of cpus provided by the node type. how does it know that? or does it have to create them first?)
marking an actor as requiring a resource such as {“r”: 1.0} and specifying in the configuration file that the machine type has some number of this resource (say, 9.0). this should allow up to 9 actors to run on this node. result : i am not sure why, but ray said that no machine can satisfy the requirements.
Hmm this sounds a bit weird. Can you give me a short script I can try? Google cloud shouldn’t be related to this.
on a related question, ray seems to know the number of cpus provided by the node type
When ray is started without --num-cpus (e.g., ray start --address=$HEAD_NODE_ADDR), we automatically detect the num cpus from the machine.
Lastly, note that by default the actor requires 0 cpu. E.g,
@ray.remote
class A:
pass
# is equal to
@ray.remote(num_cpus=0)
class A:
pass
Here is some test code which I expect to spawn two worker machines, and it doesn’t:
import ray
import time
from ray.util import ActorPool
# my cluster will have 5 resources per node, so I exppect to allocate 2 nodes.
ray.init(address='auto')
@ray.remote(num_cpus=1.0, resources={"r": 1})
class MyActor:
def __init__(self):
pass
def work(self):
while True:
time.sleep(2)
actors = [MyActor.remote() for _ in range(10)]
pool = ActorPool(actors)
for x in pool.map_unordered(lambda a, v: a.work.remote(), list(range(10))):
pass
Here is the relevant part from the config, the rest should be pretty much as in the example (using the ray docker image):
Looks like your node doesn’t contain the custom resource r (based on your monitor output). (which is weird given you passed r:5). cc @Ameer_Haj_Ali@Alex o you guys know what’s the issue here?
NOTE: You can also see the comprehensive monitor status using ray status command
TLDR: You cannot use resources in worker_nodes because the only fields that can go under worker_nodes are the cloud provider configurations, instead you should use available_node_types.
Actually, @rliaw 's solution worked! adding CPU: num under resources in the worker-config solved it.
I will try also the available_node_types solution, though I’ll admit it is hard for me to follow the syntax via the references there. (would be great if there could be a gcp example file for the multiple-node-types option)