Ray actors cannot be scheduled due to resources constraints

raytune_kuberay_user · October 28, 2022, 9:49pm

We are running ray tune where each trial is running with 2 horovod workers.

On ray dashboard, there is an error message. BaseHorovodWorker cannot be created because the Ray cluster cannot satisfy its resource requirements.

When I run ray status, this is the output.

======== Autoscaler status: 2022-10-28 21:33:23.829031 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node_a4ac8c7f866699e1bc8288994fd1d61b876b4be93c67f4d9a02c4f1b
 1 node_2d8e044893addc45eba880ba0dfb495bda47b767506fd3b59d405976
 1 node_03ce0fcc29a268b0152701ef3e9c95c1c382dfb4a02e59bbd1de0b53
 1 node_9563ded98c6c1caf78a7cd20082682c013cea2bea73e38a4d5d39562
 1 node_5d22539387a4b1b77915de793251effc732ff686ca3a2dc704a1690a
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 10.0/48.0 CPU (10.0 used of 40.0 reserved in placement groups)
 1.0/4.0 GPU (1.0 used of 4.0 reserved in placement groups)
 0.0/4.0 accelerator_type:V100
 0.00/136.000 GiB memory
 0.00/37.717 GiB object_store_memory

Demands:
 {'CPU': 10.0, 'GPU': 1.0}: 1+ pending tasks/actors (1+ using placement groups)
 {'CPU': 10.0, 'GPU': 1.0} * 2 (PACK): 11+ pending placement groups

From this output, the cluster has enough resources for {'CPU': 10.0, 'GPU': 1.0}: however, the actor is still at pending and cannot be scheduled.

We are using horovod.DistributedTrainableCreator to contruct trainable.

Thanks.

Dmitri · October 28, 2022, 10:45pm

@sangcho @Alex do you remember how we can get Ray node resource capacities?

sangcho · October 29, 2022, 12:42am

What’s your ray version? (just to see if we can use the new debugging tool)

My suspicion is that the pending actor is waiting for the placement group that’s not been created yet. I am not sure why you have so many pending placement groups, but you have 11 of them pending, and if the actor requires one of pending placement groups, you cannot schedule your actor

raytune_kuberay_user · October 29, 2022, 1:09am

1.12.0 is the ray version.
We are using ray-tune. Each trial is using [{‘CPU’: 10.0, ‘GPU’: 1.0}, {‘CPU’: 10.0, ‘GPU’: 1.0}] and there are 12 trials.

sangcho · October 29, 2022, 11:02am

Is this possible to try ray 2.0? That will make it much easier to troubleshoot this issue. If not, I can help troubleshooting other ways, so lmk!

raytune_kuberay_user · October 31, 2022, 4:44pm

No. We cannot try ray 2.0.

raytune_kuberay_user · October 31, 2022, 4:50pm

We requested 1 ray head, 6 ray workers (each with 10 cpu and 1 gpu).
Only 1 ray head and 4 ray workers are launched.
The other 2 ray workers are dead. → But this should not affect scheduling since we still have enough resources in the ray cluster.

cat /tmp/ray/session_latest/logs/raylet.out 
...
[2022-10-28 06:19:43,346 W 23 23] (raylet) agent_manager.cc:94: Agent process with pid 60 has not registered. ip , pid 0
[2022-10-28 06:19:43,351 W 23 61] (raylet) agent_manager.cc:104: Agent process with pid 60 exit, return value 0. ip . pid 0
[2022-10-28 06:19:43,351 E 23 61] (raylet) agent_manager.cc:107: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.

raytune_kuberay_user · November 3, 2022, 6:11pm

@sangcho a friendly ping on this ticket.

xwjiang2010 · November 3, 2022, 8:57pm

Can you share the script you are running?

raytune_kuberay_user · November 3, 2022, 9:19pm

@xwjiang2010 The code base is pretty large. But this is the core logic.

            def train_func(hp, checkpoint_dir=None):
                xxx
                model.fit()

            train_func = horovod_dist.DistributedTrainableCreator(
                train_func,
                
                use_gpu=True,
                num_workers=2,
                num_cpus_per_worker=10,
            )

            analysis = tune.run(
                trainable,
                metric=objective_name,
                mode=objective_direction,
                num_samples=12,
                config=hp,
                resources_per_trial=resources_per_trial,
                search_alg="random",
                max_concurrent_trials=None,
            )

We still have our pod running. Is there any command we can run to figure out why the scheduling is blocked there. Is there a place to see the scheduling logs?

we are using kuberay 0.1.0

xwjiang2010 · November 3, 2022, 11:15pm

Is that no trial is running? I wonder then why the resources is saying 10 CPUs are used and 1 GPU is used.

raytune_kuberay_user · November 4, 2022, 6:11pm

Each trial will use {‘CPU’: 10.0, ‘GPU’: 1.0} * 2 (PACK). 2 means each trials has two horovod workers.

So what happens is that for the first trial. Only one horovod worker {‘CPU’: 10.0, ‘GPU’: 1.0} is scheduled, the other horovod worker {‘CPU’: 10.0, ‘GPU’: 1.0} is not.

raytune_kuberay_user · November 4, 2022, 6:12pm

also double check the doc here. Placement Groups — Ray 2.0.1

For PACK, All provided bundles are packed onto a single node on a best-effort basis. If strict packing is not feasible (i.e., some bundles do not fit on the node), bundles can be placed onto other nodes nodes… So this should not be an issue.

xwjiang2010 · November 4, 2022, 9:18pm

yes, your understanding is correct.

What puzzles me is how can a partial placement group is granted (i.e. for one horovod worker) when Tune always requests for the whole trial (equivalent of two workers) at a time.

@sangcho any ideas?

sangcho · November 9, 2022, 10:42pm

When the actor is scheduled, it requires both resources and a placement group. So, it is highly likely the placement group that the actor requires is not ready (you have 11 pending pgs, so one of them is probably required from the pending actor). If you see your CPU usage, 40 CPUs are reserved out of 48, and that’s why you cannot schedule an additional placement group anymore.

sangcho · November 9, 2022, 10:43pm

I am not exactly sure how Horovod creates placement groups and actors. @xwjiang2010 can you tell me a bit more detail about how this works here? Like each horovod worker uses 1 placement group? How do you specify the resource usage?

xwjiang2010 · November 9, 2022, 10:46pm

So I just tried this on 1.12.0:

    horovod_trainable = DistributedTrainableCreator(
        train,
        num_workers=2,
        num_cpus_per_worker=1,
    )

This results in placement group request as [{}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}].

sangcho · November 9, 2022, 11:01pm

Each pg is used by BaseHorovodWorker ? How do you specify the GPU usage?

raytune_kuberay_user · November 9, 2022, 11:14pm

the gpu is passed to DistributedTrainableCreator. There is a use_gpu field in DistributedTrainableCreator. If set to true, gpu will be 1.

xwjiang2010 · November 10, 2022, 8:08pm

Just updating the thread so that it may be helpful as a reference.

So basically what happened is:

Each trial is taking [{“CPU”: 10, “GPU”: 1}, {“CPU”: 10, “GPU”: 1}]. Initially two trials request their corresponding placement groups.
trial1’s placement group is ready.
tune starts trial1. In the meantime, one bundle inside of the placement is dead.
As a result, trial1’s start hangs at: horovod/strategy.py at v0.23.0 · horovod/horovod · GitHub
only partial placement group appears to be in use.

Topic		Replies	Views
Resource allocation issue for ray tune with horovod on k8s Ray Tune	2	438	September 15, 2022
Reserve workers on GPU node for trainer workers only RLlib	7	1116	June 3, 2022
Ray IP-based scheduling using placement group hang Ray Core	6	374	July 16, 2021
Placement group timeout error - not enough resources for the cluster Ray Core	1	1180	June 1, 2022
Raytune does not use resources of the second node Ray Clusters	1	348	June 15, 2023

Ray actors cannot be scheduled due to resources constraints

Related topics