Ray IP-based scheduling using placement group hang

I need to schedule specific actor/task to specific machines. By using the placement group and node_id as resource, I was able to achieve this purpose. However, when I run pytest with multiple programs, it hangs, maybe due to resource contention??

This is how I create the placementgroup, maybe in a unusual or even wrong way.
try:
pg = []
res_cap = {node_id: 1}. # use node_id as resource with capacity of 1
cpu_cap = {“CPU”: 1}
pg.append(res_cap)
pg.append(cpu_cap)
return placement_group(
pg, strategy=PlacementStrategy.STRICT_PACK
)
except Exception as e:
print(e)

Then I used pytest to test 2 files, both created pg the same way as above, it ran into issues that the second test function hangs there. The ray status shows a list of pending tasks/actors, which I don’t know its meaning.

Resources

Usage:
0.0/32.0 CPU
0.00/69.754 GiB memory
0.00/33.886 GiB object_store_memory

Demands:
{‘CPU_group_ed4e518ee308c183c1704fa219f35dfe’: 1.0}: 1+ pending tasks/actors
{‘CPU_group_f1aaa2d6530300eb22f3b91958bfc10e’: 1.0}: 1+ pending tasks/actors
{‘CPU_group_8eb1936673b9e77c491b9145b39d2653’: 1.0}: 1+ pending tasks/actors
{‘CPU_group_70faa0a852b14c608765e750e07fe568’: 1.0}: 1+ pending tasks/actors
{‘CPU_group_cac9fe0ad2a29d55c25d7d01ea0f0a5f’: 1.0}: 1+ pending tasks/actors
{‘CPU_group_eb09d2b20272f2aaa2e6265da0753f6d’: 1.0}: 1+ pending tasks/actors

Can I get some advise how to solve this issue and is my way of generating placementgroup correct or not? Thanks very much

1 Like

Yeah it is because you require 1 node_id resource. Each node only has 1 node id resource, so you should require less node id resource. Indeed, it is okay to require only a small amount of node id resource (and that’s more robust). For example, {“node_id”: 0.01} should suffice your use case.

Thanks sancho! Yeah that solves the problem. It is good to know node_id as resource also support fractional.
But what if we endup creating more than 100 placementgroup and each requests 0.01 of such resource? Is there a mechanism to determine dynamically update this manually set number?
Thanks.

I think in that case, you’d like to just use the smaller value. Afaik, we might not support values that are less than 0.001, but it is unlikely you need that many placement group in the same node without removing it.

One other possibility is to actually use your own custom resources (which you can do when you call ray start)

Thank you sangcho! Do we need to explicitly/manually remove those placementgroups for every pg that was created? Are they going to be automatically reclaimed when the program exits?

PGs are removed when the job that creates it is exited unless you make the lifetime as detached!

Got you sangcho! Appreciate your help