Removing placement group from remote function

Jennice_Tao · February 17, 2021, 12:52am

I want to create a placement group, pass to a remote task to use and remove when it’s done.

@ray.remote
def large_task(pg):
   ...
   remove_placement_group(pg)

for i in range(100):
   pg = placement_group(...)
   large_task.remote(pg)

there will be only enough resource to create 10 placement groups for example. in this code, I don’t need to worry about cleaning up pg. Calling remove_placement_group in the remote function seem to cause a seg-fault. The code seems to use global_worker and etc. What’s the right way of cleaning up the placement groups?

sangcho · February 17, 2021, 1:27am

Hmm this is not supposed to cause seg fault. Can you create a Github issue with the reproduction?

Jennice_Tao · February 17, 2021, 1:32am

(pid=3643, ip=xxx.xxx.xxx.xxx) 2021-02-16 23:09:49,506 ERROR worker.py:390 – SystemExit was raised from the worker
(pid=3643, ip=xxx.xxx.xxx.xxx) Traceback (most recent call last):
(pid=3643, ip=xxx.xxx.xxx.xxx) File “python/ray/_raylet.pyx”, line 570, in ray._raylet.task_execution_handler
(pid=3643, ip=xxx.xxx.xxx.xxx) File “python/ray/_raylet.pyx”, line 434, in ray._raylet.execute_task
(pid=3643, ip=xxx.xxx.xxx.xxx) File “python/ray/_raylet.pyx”, line 473, in ray._raylet.execute_task
(pid=3643, ip=xxx.xxx.xxx.xxx) File “python/ray/_raylet.pyx”, line 476, in ray._raylet.execute_task
(pid=3643, ip=xxx.xxx.xxx.xxx) File “python/ray/_raylet.pyx”, line 480, in ray._raylet.execute_task
(pid=3643, ip=xxx.xxx.xxx.xxx) remove_placement_group(pg)
(pid=3643, ip=xxx.xxx.xxx.xxx) File “/home/centos/.local/lib/python3.7/site-packages/ray/util/placement_group.py”, line 211, in remove_placement_group
(pid=3643, ip=xxx.xxx.xxx.xxx) worker.core_worker.remove_placement_group(placement_group.id)
(pid=3643, ip=xxx.xxx.xxx.xxx) File “/home/centos/.local/lib/python3.7/site-packages/ray/worker.py”, line 387, in sigterm_handler
(pid=3643, ip=xxx.xxx.xxx.xxx) sys.exit(1)
(pid=3643, ip=xxx.xxx.xxx.xxx) SystemExit: 1

this was the exact trace.

sangcho · February 17, 2021, 1:56am

Ah, this is the expected behavior. The tasks and actors scheduled with the placement group fate share with the pg.

sangcho · February 17, 2021, 1:58am

I am not sure if I mentioned in the doc. I will update it if I didn’t. To make sure your code would look like this right?

@ray.remote
def task(pg):
    remove_placement_group(pg)

pg = placement_group(...)
task.options(placement_group=pg).remote(pg)

sangcho · February 17, 2021, 2:14am

One of possible solution is to remove placement groups in a driver that submits a task after all tasks are done.

Jennice_Tao · February 17, 2021, 3:42am

Thanks for confirming that! If I were to manage the placement group on the driver code, I would need a placement group pool and need to actively control # of jobs running (otherwise PG creation would block the execution). With the above code, the scheduling behaviour is controlled by the PG logic and I cannot create more PGs than allowed by the resource constraints. This was much simpler. I would probably need to control the exception of the task getting killed then.

Jennice_Tao · February 17, 2021, 3:48am

Sorry. I was mistaken that placement_group is a blocking operation. It’s async operation… I was using ray.get(pg.ready()) by default.

Jennice_Tao · February 17, 2021, 3:54am

it still does not change the fact for me that discarding PG at the consumer of PG makes it easier to use. The driver creates a PG, launches a job with PG, and that’s it.

Jennice_Tao · February 17, 2021, 4:10am

I can deal with this by one more remote function though I would try to avoid nested remote calls.

sangcho · February 17, 2021, 7:26am

Do you usually have a single task per placement group? If you have multiple tasks, removing it from consumers can easily cause some race condition right? (given there’s the fate sharing behavior)

sangcho · February 17, 2021, 7:27am

One of solutions (probably not ideal, but it might work) is to create a detached actor that is dedicated to remove placement groups. You can request to the detached actor when you have to GC pgs.

Jennice_Tao · February 17, 2021, 3:42pm

I ended up using a launcher remote function that creates PG, wait, start new a large job using PG, and remove PG. Yes, I wanted to use a dedicated machine for each large job. Thanks for checking!

Topic		Replies	Views
New placement groups not being created after initial round Ray Core	3	343	August 30, 2022
How to create @ray.remote jobs that will only run on the workers from the local node? Ray Core	6	2199	May 9, 2021
Ray Cluster: ensure each new task goes to different node Ray Core	1	203	February 8, 2021
Ray IP-based scheduling using placement group hang Ray Core	6	366	July 16, 2021
Placement group with iterator to spread function to all CPU's in the cluster Ray Core	6	375	June 8, 2022

Removing placement group from remote function

Related topics