What’s your ray version? (just to see if we can use the new debugging tool)
My suspicion is that the pending actor is waiting for the placement group that’s not been created yet. I am not sure why you have so many pending placement groups, but you have 11 of them pending, and if the actor requires one of pending placement groups, you cannot schedule your actor
We requested 1 ray head, 6 ray workers (each with 10 cpu and 1 gpu).
Only 1 ray head and 4 ray workers are launched.
The other 2 ray workers are dead. → But this should not affect scheduling since we still have enough resources in the ray cluster.
cat /tmp/ray/session_latest/logs/raylet.out
...
[2022-10-28 06:19:43,346 W 23 23] (raylet) agent_manager.cc:94: Agent process with pid 60 has not registered. ip , pid 0
[2022-10-28 06:19:43,351 W 23 61] (raylet) agent_manager.cc:104: Agent process with pid 60 exit, return value 0. ip . pid 0
[2022-10-28 06:19:43,351 E 23 61] (raylet) agent_manager.cc:107: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.
We still have our pod running. Is there any command we can run to figure out why the scheduling is blocked there. Is there a place to see the scheduling logs?
Each trial will use {‘CPU’: 10.0, ‘GPU’: 1.0} * 2 (PACK). 2 means each trials has two horovod workers.
So what happens is that for the first trial. Only one horovod worker {‘CPU’: 10.0, ‘GPU’: 1.0} is scheduled, the other horovod worker {‘CPU’: 10.0, ‘GPU’: 1.0} is not.
For PACK, All provided bundles are packed onto a single node on a best-effort basis. If strict packing is not feasible (i.e., some bundles do not fit on the node), bundles can be placed onto other nodes nodes… So this should not be an issue.
What puzzles me is how can a partial placement group is granted (i.e. for one horovod worker) when Tune always requests for the whole trial (equivalent of two workers) at a time.
When the actor is scheduled, it requires both resources and a placement group. So, it is highly likely the placement group that the actor requires is not ready (you have 11 pending pgs, so one of them is probably required from the pending actor). If you see your CPU usage, 40 CPUs are reserved out of 48, and that’s why you cannot schedule an additional placement group anymore.
I am not exactly sure how Horovod creates placement groups and actors. @xwjiang2010 can you tell me a bit more detail about how this works here? Like each horovod worker uses 1 placement group? How do you specify the resource usage?