1. Severity of the issue: (select one)
High: Completely blocks me.
2. Environment:
- Ray version: 2.48.0
- Python version: 3.11.11
- OS: Ubuntu 24.04.3 LTS with 127 cores
- GPU Infrastructure:
- 4 GPU of NVIDIA-SMI 535.230.02.
- Driver Version: 535.230.02
- CUDA Version: 12.2
3. Context
Hi there! Thanks for the awesome package.
I am migrating from Parallel to Ray due to the possibility of sharing an object in different processors.
4. Problem
I need to load 2 pre trained foundation models (deep learning):
num_cpus_actor = .0125
num_gpus_actor = .0115
@ray.remote(num_gpus=num_gpus_actor, num_cpus=num_cpus_actor)
class ChronosModelActor:
def __init__(self, model_path: str, device: str, torch_dtype):
self.model = load() # it loads the model
def predict(self, context_tensor, forecast_horizon: int):
mean = self.model.predict()
return mean
@ray.remote(num_gpus=num_gpus_actor, num_cpus=num_cpus_actor)
class TimeMoEModelActor:
def __init__(self, model_path: str, device: str, torch_dtype):
self.model = load() # it loads the model
def predict(self, context_tensor, forecast_horizon: int):
return self.model.predict()
chronos_actor = ChronosModelActor.remote()
timemoe_actor = TimeMoEModelActor.remote()
And I use them across different CPU cores, while the prediction itself is using GPU:
@ray_debug
@ray.remote(num_cpus=.95, num_gpus=0)
def process_series(
ray_actors,
...
):
N_CORES = 40
ray.init(num_cpus=N_CORES, num_gpus=1, include_dashboard=False, local_mode=False)
futures = []
for directory in directories:
futures.append(
process_series.remote(
ray_actors={"chronos": chronos_actor, "timemoe": timemoe_actor},
... # Other arguments
)
)
results = ray.get(futures)
So, in this case I set up to run 40 directory
in parallel. In each run, I am going to use both Actors (each using num_cpus_actor
and num_gpus_actor
).
In my math, for each process I would have the total usage:
- CPU for each process:
.95 (task function) + .0125*2 (small usage of actors in cpu) = .975
. - GPU for each process:
.0125*2= .0250
. All cores use the sameGPU
. The memory each models needs to runpredict
is very low.
With all of this in mind I get the following message:
(autoscaler +1m9s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +1m9s) Warning: The following resource request cannot be scheduled right now: {'CPU': 0.95}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
I have tried to change the parameter values of CPU/GPU with no success.
Questions:
- How fix the messages from autoscaler setting up correct number of CPU/GPU/nodes?
- Even tough Ray could not allocate the task, does it put it into a queue? (It seems that all directories were running)