How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
My objective is to run some profiling-related tasks that use a fixed number of GPUs. In my case, I have 2 types of GPUs in my cluster, so I would like to cut my cluster into 2 groups, and assign tasks to those 2 groups of nodes with some level of locality awareness. This would allow me to parallelize the profiling (across actors from the same group) while knowing exactly from which group the results are being returned from (across those 2 groups).
Note that, because the task is using a fixed number of GPUs, it may not occupy a whole node and thus multiple workers can be on the same node, using the same type of GPUs.
It seems to me that ActorPool is the right tool to use, but when I specify custom resources for actors inside the actor pool (so that I am sure they are colocated on the same resource type), the last few submitted tasks to the pool will be blocking and running indefinitely after trying get_next_unordered if 2 or more actors are created on the same node. The number of blocked tasks depends on the number of actors in the pool. (1, if the pool contains 2 actors on the same node). Placement group doesn’t seem to solve this problem.
My hypothesis is that somehow actors inside the ActorPool are competing for resources behind the hood, but I have no idea if there’s a way to fix the issue. I have tried to change to ray.util.multiprocessing instead, but it seems to me that doesn’t allow to pass customized actors which may themselves call actor handles inside.