I want to create a placement group, pass to a remote task to use and remove when it’s done.
@ray.remote
def large_task(pg):
...
remove_placement_group(pg)
for i in range(100):
pg = placement_group(...)
large_task.remote(pg)
there will be only enough resource to create 10 placement groups for example. in this code, I don’t need to worry about cleaning up pg. Calling remove_placement_group in the remote function seem to cause a seg-fault. The code seems to use global_worker and etc. What’s the right way of cleaning up the placement groups?
Thanks for confirming that! If I were to manage the placement group on the driver code, I would need a placement group pool and need to actively control # of jobs running (otherwise PG creation would block the execution). With the above code, the scheduling behaviour is controlled by the PG logic and I cannot create more PGs than allowed by the resource constraints. This was much simpler. I would probably need to control the exception of the task getting killed then.
it still does not change the fact for me that discarding PG at the consumer of PG makes it easier to use. The driver creates a PG, launches a job with PG, and that’s it.
Do you usually have a single task per placement group? If you have multiple tasks, removing it from consumers can easily cause some race condition right? (given there’s the fate sharing behavior)
One of solutions (probably not ideal, but it might work) is to create a detached actor that is dedicated to remove placement groups. You can request to the detached actor when you have to GC pgs.
I ended up using a launcher remote function that creates PG, wait, start new a large job using PG, and remove PG. Yes, I wanted to use a dedicated machine for each large job. Thanks for checking!