Placement group mechanism - why actor can't take free slot for same group?

Akkarine · August 11, 2021, 9:25pm

Or what does mean group_2_<with_same_id>?

asm582 · August 11, 2021, 10:06pm

Hello,

looking at the logs I see that group_2_ is a different group, if this helps please check the format of the CPU_group string. I think the groups you mentioned indeed do not have any resources hence the actor does not schedule.

Akkarine · August 12, 2021, 4:59am

Thanks for attention. I have only two types of groups. So there can’t be other groups. And group with _2 count has the same group id as not scheduled actor. I don’t understand, why this group not fitting for it.

I’m creating groups with that piece of code
Study actor demands 0.5 CPU. @ray.remote(num_cpus=0.5) as controller process. And runs and watch other heavy compute actor with 1 CPU demand, plasing it to pg_load group.
Aim is to spread ‘controllers’ across nodes, because they are not taking much CPU time when there is load, so some nodes just wait (‘strict spread’ policy. BTW, jus spread was not working for that). And to utilize all other CPU power for actually computing actors (spread policy).

...
        node_count = len([entry for entry in ray.cluster_resources().keys() if entry[:3]=='nod'])
        study_controller_cpu_cost = 0.5
        studies_per_node = 3
        min_cpu_num_on_node = 4
        study_controller_total_cpu_cost_on_node = int(round(studies_per_node * study_controller_cpu_cost))
        assert study_controller_total_cpu_cost_on_node <= min_cpu_num_on_node, 'study bundle not fitting on node, decrease studies_per_node or study_controller_cpu_cost'
        studies_in_parallel = node_count * studies_per_node

        try:
            load_pg = ray.util.get_placement_group("load_pg")
            study_pg = ray.util.get_placement_group("study_pg")
        except ValueError:
            study_bundles = [{"CPU": study_controller_total_cpu_cost_on_node} for _ in range(node_count)]
            study_pg = placement_group(study_bundles, strategy="STRICT_SPREAD", name='study_pg')
            load_bundles = [{"CPU": 1} for _ in range(studies_in_parallel)]
            load_pg = placement_group(load_bundles, strategy="SPREAD", name='load_pg')
        ray.get(study_pg.ready())
        ray.get(load_pg.ready())
        for proc_id in range(studies_in_parallel):
            actors.append(
                Study.options(
                    name=f'Study proc#{proc_id}',
                    placement_group=study_pg
                ).remote(
                    placement_group_ref=load_pg,
                )
            )
...

asm582 · August 12, 2021, 12:56pm

thank you for sharing more information. I think your setup seems to have two placement strategies and one group does not have resources. Have to tried increasing resources in the starved group to see if your actor gets scheduled?

Topic		Replies	Views
Ray IP-based scheduling using placement group hang Ray Core	6	374	July 16, 2021
Will Ray over-subscribe the bundles in a placement groups? Ray Clusters	4	612	January 24, 2022
Weird Interaction between Actor Pool and node-specific actors handles Ray Core	1	305	August 19, 2023
New placement groups not being created after initial round Ray Core	3	363	August 30, 2022
Placement Group is created but demand is pending Kubernetes	0	197	March 14, 2024

Placement group mechanism - why actor can't take free slot for same group?

Related topics