What is the Math for allocating GPU and CPU resources?

1. Severity of the issue: (select one)
High: Completely blocks me.

2. Environment:

  • Ray version: 2.48.0
  • Python version: 3.11.11
  • OS: Ubuntu 24.04.3 LTS with 127 cores
  • GPU Infrastructure:
    • 4 GPU of NVIDIA-SMI 535.230.02.
    • Driver Version: 535.230.02
    • CUDA Version: 12.2

3. Context
Hi there! Thanks for the awesome package.
I am migrating from Parallel to Ray due to the possibility of sharing an object in different processors.

4. Problem
I need to load 2 pre trained foundation models (deep learning):

num_cpus_actor = .0125
num_gpus_actor = .0115
@ray.remote(num_gpus=num_gpus_actor, num_cpus=num_cpus_actor)
class ChronosModelActor:
    def __init__(self, model_path: str, device: str, torch_dtype):
        self.model = load() # it loads the model

    def predict(self, context_tensor, forecast_horizon: int):
        mean = self.model.predict()
        return mean

@ray.remote(num_gpus=num_gpus_actor, num_cpus=num_cpus_actor)
class TimeMoEModelActor:
    def __init__(self, model_path: str, device: str, torch_dtype):
        self.model = load() # it loads the model

    def predict(self, context_tensor, forecast_horizon: int):
        return self.model.predict()        

chronos_actor = ChronosModelActor.remote()
timemoe_actor = TimeMoEModelActor.remote()

And I use them across different CPU cores, while the prediction itself is using GPU:

@ray_debug
@ray.remote(num_cpus=.95, num_gpus=0)
def process_series(
    ray_actors,
    ...
):

N_CORES = 40
ray.init(num_cpus=N_CORES, num_gpus=1, include_dashboard=False, local_mode=False)
futures = []
for directory in directories:
    futures.append(
        process_series.remote(
            ray_actors={"chronos": chronos_actor, "timemoe": timemoe_actor},
             ... # Other arguments
        )
    )
    results = ray.get(futures)

So, in this case I set up to run 40 directory in parallel. In each run, I am going to use both Actors (each using num_cpus_actor and num_gpus_actor).

In my math, for each process I would have the total usage:

  • CPU for each process: .95 (task function) + .0125*2 (small usage of actors in cpu) = .975.
  • GPU for each process: .0125*2= .0250. All cores use the same GPU. The memory each models needs to run predict is very low.

With all of this in mind I get the following message:

(autoscaler +1m9s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +1m9s) Warning: The following resource request cannot be scheduled right now: {'CPU': 0.95}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.

I have tried to change the parameter values of CPU/GPU with no success.

Questions:

  1. How fix the messages from autoscaler setting up correct number of CPU/GPU/nodes?
  2. Even tough Ray could not allocate the task, does it put it into a queue? (It seems that all directories were running)

Hey so the issue here is that fractional gpu resources don’t really work today. See [Core] Ray fractional GPU unable to be scheduled · Issue #52133 · ray-project/ray · GitHub. We’re planning on fixing this very soon, so it shouldn’t be long until you can! In the mean time, you can try the workaround mentioned in the issue.

Thank you.But in that case, they are using two nodes and receives an error, while I am using a single one and don’t receive an error (but an annoying warning).

If I start 40 process at the same time, I would have this total usage of GPU: 40*(.0125*2)=1. So it fits exactly. Even when I tried GPU: 40*(.0115*2)=.92. Or when I tried 40*(.006*2)=.48 I got the same message:

(autoscaler +49s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +49s) Warning: The following resource request cannot be scheduled right now: {'CPU': 0.9}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.

Maybe my problem is more related to CPU than GPU. Even when I tried to use only 4 cores I receive the same message.