Ray IP-based scheduling using placement group hang

nostalgicimp · July 9, 2021, 6:20pm

I need to schedule specific actor/task to specific machines. By using the placement group and node_id as resource, I was able to achieve this purpose. However, when I run pytest with multiple programs, it hangs, maybe due to resource contention??

This is how I create the placementgroup, maybe in a unusual or even wrong way.
try:
pg = []
res_cap = {node_id: 1}. # use node_id as resource with capacity of 1
cpu_cap = {“CPU”: 1}
pg.append(res_cap)
pg.append(cpu_cap)
return placement_group(
pg, strategy=PlacementStrategy.STRICT_PACK
)
except Exception as e:
print(e)

Then I used pytest to test 2 files, both created pg the same way as above, it ran into issues that the second test function hangs there. The ray status shows a list of pending tasks/actors, which I don’t know its meaning.

Resources

Usage:
0.0/32.0 CPU
0.00/69.754 GiB memory
0.00/33.886 GiB object_store_memory

Demands:
{‘CPU_group_ed4e518ee308c183c1704fa219f35dfe’: 1.0}: 1+ pending tasks/actors
{‘CPU_group_f1aaa2d6530300eb22f3b91958bfc10e’: 1.0}: 1+ pending tasks/actors
{‘CPU_group_8eb1936673b9e77c491b9145b39d2653’: 1.0}: 1+ pending tasks/actors
{‘CPU_group_70faa0a852b14c608765e750e07fe568’: 1.0}: 1+ pending tasks/actors
{‘CPU_group_cac9fe0ad2a29d55c25d7d01ea0f0a5f’: 1.0}: 1+ pending tasks/actors
{‘CPU_group_eb09d2b20272f2aaa2e6265da0753f6d’: 1.0}: 1+ pending tasks/actors

Can I get some advise how to solve this issue and is my way of generating placementgroup correct or not? Thanks very much

sangcho · July 12, 2021, 5:56pm

Yeah it is because you require 1 node_id resource. Each node only has 1 node id resource, so you should require less node id resource. Indeed, it is okay to require only a small amount of node id resource (and that’s more robust). For example, {“node_id”: 0.01} should suffice your use case.

nostalgicimp · July 13, 2021, 2:17am

Thanks sancho! Yeah that solves the problem. It is good to know node_id as resource also support fractional.
But what if we endup creating more than 100 placementgroup and each requests 0.01 of such resource? Is there a mechanism to determine dynamically update this manually set number?
Thanks.

sangcho · July 13, 2021, 7:08pm

I think in that case, you’d like to just use the smaller value. Afaik, we might not support values that are less than 0.001, but it is unlikely you need that many placement group in the same node without removing it.

One other possibility is to actually use your own custom resources (which you can do when you call ray start)

nostalgicimp · July 15, 2021, 8:05am

Thank you sangcho! Do we need to explicitly/manually remove those placementgroups for every pg that was created? Are they going to be automatically reclaimed when the program exits?

sangcho · July 15, 2021, 5:30pm

PGs are removed when the job that creates it is exited unless you make the lifetime as detached!

nostalgicimp · July 16, 2021, 1:16am

Got you sangcho! Appreciate your help

Topic		Replies	Views
New placement groups not being created after initial round Ray Core	3	374	August 30, 2022
Placement group mechanism - why actor can't take free slot for same group? Ray Core	3	316	August 12, 2021
Ray actors cannot be scheduled due to resources constraints	19	2112	November 10, 2022
How to: ensure actor is running on the same node only? Ray Core	13	1897	May 13, 2021
Will Ray over-subscribe the bundles in a placement groups? Ray Clusters	4	614	January 24, 2022

Ray IP-based scheduling using placement group hang

Resources

Related topics