Requst can not be scheduled if the actor number is larger than the number of gpu

I run the code form “Batch inference with Ray Actors”

My machine has 4 GPUs and I config the actors like this:

@ray.remote(num_gpus=1,num_cpus=1)
class PredictionActor:

I tried to create 8 actors and started the inference. The program hung there and printed this:

Since it’s a warning, I think the system shoudn’t hang anyway.

It is normal or a bug?

OS: Centos 7.9
Ray: 2.3
Python: 3.9
128 cpus
4 gpus with 24G mem
132G mem.

Yea it’s expected since we don’t have enough GPU resources to schedule all 8 actors. Only the first 4 actors can be scheduled and the remaining 4 will be in the pending state.

It is fine to schedule the remaining 4 in the pending state. But those pending actors seems stuck there forever. In my understanding , they will be put into running as long as the other 4 actors finished.

From the screenshot, you can see 7 mins elaplsed before I killed that program.

actor will only be finished when there is no reference to it. Are you still holding references to the first 4 actors?

What I changed are to tell ray to use gpu and to increase the actor number to 8 . But you are right, the code given in the example seems bugy. I will double check it.

Thanks a lot

I checked the example code again. It’s hard to say it is a bug . I believe the author did not think too much on it. If the actor number is larger than the resources system can offer, the program will hang forever.

I would like dive deepr . Pls help me clearify

Suppose I have 4 gpus availabe now

  1. I submit 8 actors but only 4 actors can be created and become active while another 4 will suspend there.
    2). The 4 active actors will occupy the gpus forever until no one holds the refs or be killed explictly by ray.kill()
    3). Even the 4 active actors do nothing and just idle , another 4 suspended actors have no chance to be activated to run.

Simply speaking, as long as an actor lives, the resource it occupies won’t be released. Am I right ? It is designed intentionly ?

Thanks for your time.

Yes, your understanding is correct. It’s designed intentionally. When you specify num_gpus=1, you are saying for the lifetime of this actor, reserve 1 gpu for it regardless the actual physical usage.

Note that Ray resources are logical: Resources — Ray 2.3.0