Where to instantiate named actors for global coordination

anyboby · January 12, 2022, 4:43pm

Hi there,

I am currently implementing a version of DQN with multiple workers where a global matrix is continuously updated in-place, requiring global coordination, which is why I am using a named actor containing this matrix.

I am, however, quite confused by what the preferred workflow for this is, and specifically where to instantiate the named actors and how to control the allocated resources.

Since creation is supposed to only occur once, I am currently creating the named actor as a “before_init” MixIn for the trainer. However, when allocating a gpu (-fraction) to this actor through

@resources(num_gpus=.1)
class Someactor:
       ....

ray does not seem to be able to allocate the resources to the actor, despite having resources left in principle :

trainer_config = {
            "num_gpus": 0.5,
            "num_workers": 2,
            "num_gpus_per_worker": 0.1,
}

When run, gives the following warning:

WARNING worker.py:1228 -- The actor or task with ID ... cannot be scheduled right now. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this actor or task, which may take time to install.  Otherwise, this is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increasing the resources available to this Ray cluster.

Required resources for this actor or task: 
{CPU_group_353d0795ff9507ae1f74e42ff0e34b2a: 1.000000}, 
{GPU_group_353d0795ff9507ae1f74e42ff0e34b2a: 0.100000}
Available resources on this node: {
3.000000/6.000000 CPU, 763767575.000000 GiB/763767575.000000 GiB memory, 
0.300000/1.000000 GPU, 244140625.000000 GiB/244140625.000000 GiB object_store_memory, 
0.000000/3.000000 CPU_group_353d0795ff9507ae1f74e42ff0e34b2a, 1.000000/1.000000 accelerator_type:G, 
0.000000/0.700000 GPU_group_353d0795ff9507ae1f74e42ff0e34b2a, 
0.000000/0.500000 GPU_group_0_353d0795ff9507ae1f74e42ff0e34b2a, 
0.100000/0.100000 GPU_group_1_353d0795ff9507ae1f74e42ff0e34b2a, 
1000.000000/1000.000000 bundle_group_1_353d0795ff9507ae1f74e42ff0e34b2a, 
0.100000/0.100000 GPU_group_2_353d0795ff9507ae1f74e42ff0e34b2a, 
1.000000/1.000000 CPU_group_2_353d0795ff9507ae1f74e42ff0e34b2a, 
1000.000000/1000.000000 bundle_group_2_353d0795ff9507ae1f74e42ff0e34b2a, 
1.000000/1.000000 node:192.168.2.6, 
0.000000/1.000000 CPU_group_0_353d0795ff9507ae1f74e42ff0e34b2a, 
1.000000/1.000000 CPU_group_1_353d0795ff9507ae1f74e42ff0e34b2a, 
1000.000000/1000.000000 bundle_group_0_353d0795ff9507ae1f74e42ff0e34b2a, 
3000.000000/3000.000000 bundle_group_353d0795ff9507ae1f74e42ff0e34b2a}

From what I understand, the entire process is assigned 0.5 + 2*0.1 fractional GPUs, which are then used up by the workers and the driver process.

In summary I have two questions:

Is there a better workflow than creating named actors in a trainer “before_init” MixIn (this would also allow for easier access to environment-specific parameters, e.g. observation space etc.)
What is the preferred way to control resource allocation to a named actor that is created outside the worker creation?

I hope the question is clear and thanks in advance,

anyboby

Topic		Replies	Views
How to assign a specific actor to a specific GPU Ray Core	15	1498	February 16, 2021
Creating actors when their amount is more than `num_cpus` Ray Core	8	4318	April 29, 2021
Reserve workers on GPU node for trainer workers only RLlib	7	1117	June 3, 2022
Ray actor multiple gpu available but only one used Ray Core	3	113	October 4, 2024
When are named actors shared and to whom? Ray Core	5	675	April 17, 2023

Where to instantiate named actors for global coordination

Related topics