New placement groups not being created after initial round

Raghav_G · August 24, 2022, 6:41am

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I have a workflow that requires a several tasks be run in consecutive order and so I wanted to use placement groups (pg) to do this.

As recommended by this thread, I created a nested function structure so that the PG could be created and destroyed by the same driver, but not be itself contained in the PG. The code essentially looks like the following:

@ray.remote(resources = my_resources)
def task1(input):
    ...
    return output

@ray.remote(resources = my_resources)
def task2(input):
    ...
    return output

def manage_tasks(input)
    pg = placement_group([my_resources], strategy="STRICT_PACK")
    ray.get(pg.ready())
    pg_strategy = PlacementGroupSchedulingStrategy(placement_group=pg)
    
    future1 = task1.options(scheduling_strategy=pg_strategy).remote(input)
    output = ray.get(task2.options(scheduling_strategy=pg_strategy).remote(future1))
    
    remove_placement_group(pg)
    return output

outputs = ray.get([manage_tasks.remote(i) for i in range(100)])

The issue I am having is that after the first “round” of placement groups, the other manage_tasks workers never make it past the ray.get(pg.ready()) line. I am monitoring the cluster and there is space on the worker nodes to create new PGs. The cluster also has demands for 1+ PG groups. Thoughts?

jjyao · August 25, 2022, 6:05pm

What’s the Ray version you are using? What’s the cluster setup: how many nodes, how many resources on each node? What’s my_resources? cc @sangcho

Raghav_G · August 27, 2022, 1:51pm

version:

ray==2.0.0rc1

cluster setup:

1 worker node
resources: {“CPU”: 12, “GPU”: 1, “guppy-mm2”: 1}

task1 & task2 remote decorator resources: num_gpus=1, resources={‘guppy-mm2’: 0.01}

my_resources in manage_tasks function body: {‘guppy-mm2’: 0.01, ‘CPU’: 1, ‘GPU’: 1}

jjyao · August 30, 2022, 4:09am

I’m able to reproduce it and I created an issue to track it: https://github.com/ray-project/ray/issues/28160. Thanks for reporting it!

Topic		Replies	Views
Removing placement group from remote function Ray Core	12	435	February 17, 2021
How to create @ray.remote jobs that will only run on the workers from the local node? Ray Core	6	2199	May 9, 2021
Using placement groups while connecting to Ray cluster Kubernetes	4	729	March 26, 2021
Ray Cluster: ensure each new task goes to different node Ray Core	1	203	February 8, 2021
Placement group timeout error - not enough resources for the cluster Ray Core	1	1153	June 1, 2022

New placement groups not being created after initial round

Related topics