Or what does mean group_2_<with_same_id>?
Hello,
looking at the logs I see that group_2_ is a different group, if this helps please check the format of the CPU_group string. I think the groups you mentioned indeed do not have any resources hence the actor does not schedule.
Thanks for attention. I have only two types of groups. So there can’t be other groups. And group with _2
count has the same group id as not scheduled actor. I don’t understand, why this group not fitting for it.
I’m creating groups with that piece of code
Study actor demands 0.5 CPU. @ray.remote(num_cpus=0.5)
as controller process. And runs and watch other heavy compute actor with 1 CPU demand, plasing it to pg_load
group.
Aim is to spread ‘controllers’ across nodes, because they are not taking much CPU time when there is load, so some nodes just wait (‘strict spread’ policy. BTW, jus spread
was not working for that). And to utilize all other CPU power for actually computing actors (spread policy).
...
node_count = len([entry for entry in ray.cluster_resources().keys() if entry[:3]=='nod'])
study_controller_cpu_cost = 0.5
studies_per_node = 3
min_cpu_num_on_node = 4
study_controller_total_cpu_cost_on_node = int(round(studies_per_node * study_controller_cpu_cost))
assert study_controller_total_cpu_cost_on_node <= min_cpu_num_on_node, 'study bundle not fitting on node, decrease studies_per_node or study_controller_cpu_cost'
studies_in_parallel = node_count * studies_per_node
try:
load_pg = ray.util.get_placement_group("load_pg")
study_pg = ray.util.get_placement_group("study_pg")
except ValueError:
study_bundles = [{"CPU": study_controller_total_cpu_cost_on_node} for _ in range(node_count)]
study_pg = placement_group(study_bundles, strategy="STRICT_SPREAD", name='study_pg')
load_bundles = [{"CPU": 1} for _ in range(studies_in_parallel)]
load_pg = placement_group(load_bundles, strategy="SPREAD", name='load_pg')
ray.get(study_pg.ready())
ray.get(load_pg.ready())
for proc_id in range(studies_in_parallel):
actors.append(
Study.options(
name=f'Study proc#{proc_id}',
placement_group=study_pg
).remote(
placement_group_ref=load_pg,
)
)
...
thank you for sharing more information. I think your setup seems to have two placement strategies and one group does not have resources. Have to tried increasing resources in the starved group to see if your actor gets scheduled?