Placement group mechanism - why actor can't take free slot for same group?

Or what does mean group_2_<with_same_id>?

Hello,

looking at the logs I see that group_2_ is a different group, if this helps please check the format of the CPU_group string. I think the groups you mentioned indeed do not have any resources hence the actor does not schedule.

2 Likes

Thanks for attention. I have only two types of groups. So there can’t be other groups. And group with _2 count has the same group id as not scheduled actor. I don’t understand, why this group not fitting for it.

I’m creating groups with that piece of code
Study actor demands 0.5 CPU. @ray.remote(num_cpus=0.5) as controller process. And runs and watch other heavy compute actor with 1 CPU demand, plasing it to pg_load group.
Aim is to spread ‘controllers’ across nodes, because they are not taking much CPU time when there is load, so some nodes just wait (‘strict spread’ policy. BTW, jus spread was not working for that). And to utilize all other CPU power for actually computing actors (spread policy).

...
        node_count = len([entry for entry in ray.cluster_resources().keys() if entry[:3]=='nod'])
        study_controller_cpu_cost = 0.5
        studies_per_node = 3
        min_cpu_num_on_node = 4
        study_controller_total_cpu_cost_on_node = int(round(studies_per_node * study_controller_cpu_cost))
        assert study_controller_total_cpu_cost_on_node <= min_cpu_num_on_node, 'study bundle not fitting on node, decrease studies_per_node or study_controller_cpu_cost'
        studies_in_parallel = node_count * studies_per_node

        try:
            load_pg = ray.util.get_placement_group("load_pg")
            study_pg = ray.util.get_placement_group("study_pg")
        except ValueError:
            study_bundles = [{"CPU": study_controller_total_cpu_cost_on_node} for _ in range(node_count)]
            study_pg = placement_group(study_bundles, strategy="STRICT_SPREAD", name='study_pg')
            load_bundles = [{"CPU": 1} for _ in range(studies_in_parallel)]
            load_pg = placement_group(load_bundles, strategy="SPREAD", name='load_pg')
        ray.get(study_pg.ready())
        ray.get(load_pg.ready())
        for proc_id in range(studies_in_parallel):
            actors.append(
                Study.options(
                    name=f'Study proc#{proc_id}',
                    placement_group=study_pg
                ).remote(
                    placement_group_ref=load_pg,
                )
            )
...

thank you for sharing more information. I think your setup seems to have two placement strategies and one group does not have resources. Have to tried increasing resources in the starved group to see if your actor gets scheduled?