Ray actors cannot be scheduled due to resources constraints

We are running ray tune where each trial is running with 2 horovod workers.

On ray dashboard, there is an error message. BaseHorovodWorker cannot be created because the Ray cluster cannot satisfy its resource requirements.

When I run ray status, this is the output.

======== Autoscaler status: 2022-10-28 21:33:23.829031 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node_a4ac8c7f866699e1bc8288994fd1d61b876b4be93c67f4d9a02c4f1b
 1 node_2d8e044893addc45eba880ba0dfb495bda47b767506fd3b59d405976
 1 node_03ce0fcc29a268b0152701ef3e9c95c1c382dfb4a02e59bbd1de0b53
 1 node_9563ded98c6c1caf78a7cd20082682c013cea2bea73e38a4d5d39562
 1 node_5d22539387a4b1b77915de793251effc732ff686ca3a2dc704a1690a
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 10.0/48.0 CPU (10.0 used of 40.0 reserved in placement groups)
 1.0/4.0 GPU (1.0 used of 4.0 reserved in placement groups)
 0.0/4.0 accelerator_type:V100
 0.00/136.000 GiB memory
 0.00/37.717 GiB object_store_memory

Demands:
 {'CPU': 10.0, 'GPU': 1.0}: 1+ pending tasks/actors (1+ using placement groups)
 {'CPU': 10.0, 'GPU': 1.0} * 2 (PACK): 11+ pending placement groups

From this output, the cluster has enough resources for {'CPU': 10.0, 'GPU': 1.0}: however, the actor is still at pending and cannot be scheduled.

We are using horovod.DistributedTrainableCreator to contruct trainable.

Thanks.

@sangcho @Alex do you remember how we can get Ray node resource capacities?

What’s your ray version? (just to see if we can use the new debugging tool)

My suspicion is that the pending actor is waiting for the placement group that’s not been created yet. I am not sure why you have so many pending placement groups, but you have 11 of them pending, and if the actor requires one of pending placement groups, you cannot schedule your actor

1.12.0 is the ray version.
We are using ray-tune. Each trial is using [{‘CPU’: 10.0, ‘GPU’: 1.0}, {‘CPU’: 10.0, ‘GPU’: 1.0}] and there are 12 trials.

Is this possible to try ray 2.0? That will make it much easier to troubleshoot this issue. If not, I can help troubleshooting other ways, so lmk!

1 Like

No. We cannot try ray 2.0.

  • We requested 1 ray head, 6 ray workers (each with 10 cpu and 1 gpu).
    Only 1 ray head and 4 ray workers are launched.
    The other 2 ray workers are dead. → But this should not affect scheduling since we still have enough resources in the ray cluster.
cat /tmp/ray/session_latest/logs/raylet.out 
...
[2022-10-28 06:19:43,346 W 23 23] (raylet) agent_manager.cc:94: Agent process with pid 60 has not registered. ip , pid 0
[2022-10-28 06:19:43,351 W 23 61] (raylet) agent_manager.cc:104: Agent process with pid 60 exit, return value 0. ip . pid 0
[2022-10-28 06:19:43,351 E 23 61] (raylet) agent_manager.cc:107: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.

@sangcho a friendly ping on this ticket.

Can you share the script you are running?

@xwjiang2010 The code base is pretty large. But this is the core logic.

            def train_func(hp, checkpoint_dir=None):
                xxx
                model.fit()

            train_func = horovod_dist.DistributedTrainableCreator(
                train_func,
                
                use_gpu=True,
                num_workers=2,
                num_cpus_per_worker=10,
            )

            analysis = tune.run(
                trainable,
                metric=objective_name,
                mode=objective_direction,
                num_samples=12,
                config=hp,
                resources_per_trial=resources_per_trial,
                search_alg="random",
                max_concurrent_trials=None,
            )

We still have our pod running. Is there any command we can run to figure out why the scheduling is blocked there. Is there a place to see the scheduling logs?

  • we are using kuberay 0.1.0

Is that no trial is running? I wonder then why the resources is saying 10 CPUs are used and 1 GPU is used.

Each trial will use {‘CPU’: 10.0, ‘GPU’: 1.0} * 2 (PACK). 2 means each trials has two horovod workers.

So what happens is that for the first trial. Only one horovod worker {‘CPU’: 10.0, ‘GPU’: 1.0} is scheduled, the other horovod worker {‘CPU’: 10.0, ‘GPU’: 1.0} is not.

also double check the doc here. Placement Groups — Ray 2.0.1

For PACK, All provided bundles are packed onto a single node on a best-effort basis. If strict packing is not feasible (i.e., some bundles do not fit on the node), bundles can be placed onto other nodes nodes… So this should not be an issue.

yes, your understanding is correct.

What puzzles me is how can a partial placement group is granted (i.e. for one horovod worker) when Tune always requests for the whole trial (equivalent of two workers) at a time.

@sangcho any ideas?

1 Like

When the actor is scheduled, it requires both resources and a placement group. So, it is highly likely the placement group that the actor requires is not ready (you have 11 pending pgs, so one of them is probably required from the pending actor). If you see your CPU usage, 40 CPUs are reserved out of 48, and that’s why you cannot schedule an additional placement group anymore.

I am not exactly sure how Horovod creates placement groups and actors. @xwjiang2010 can you tell me a bit more detail about how this works here? Like each horovod worker uses 1 placement group? How do you specify the resource usage?

So I just tried this on 1.12.0:

    horovod_trainable = DistributedTrainableCreator(
        train,
        num_workers=2,
        num_cpus_per_worker=1,
    )

This results in placement group request as [{}, {'CPU': 1, 'GPU': 0}, {'CPU': 1, 'GPU': 0}].

Each pg is used by BaseHorovodWorker ? How do you specify the GPU usage?

the gpu is passed to DistributedTrainableCreator. There is a use_gpu field in DistributedTrainableCreator. If set to true, gpu will be 1.

Just updating the thread so that it may be helpful as a reference.

So basically what happened is:

  1. Each trial is taking [{“CPU”: 10, “GPU”: 1}, {“CPU”: 10, “GPU”: 1}]. Initially two trials request their corresponding placement groups.
  2. trial1’s placement group is ready.
  3. tune starts trial1. In the meantime, one bundle inside of the placement is dead.
  4. As a result, trial1’s start hangs at: horovod/strategy.py at v0.23.0 · horovod/horovod · GitHub
    only partial placement group appears to be in use.
1 Like