Minimizing models loading / offloading

Hi,

I’m relatively new to Ray, very impressive library!

I’m exploring how to implement the following scenario while minimizing model loading/offloading:

  • There are k tasks of type A and k tasks of type B, where each type B task depends on one type A task.

  • Type A tasks use model M1, and type B tasks use model M2.

From what I understand, using Actors is one way to avoid reloading models for each task. However, if the cluster lacks resources to run both Actors simultaneously, is there a way to configure it so that:

  1. Actor A1 (for model M1) is created.

  2. All type A tasks are executed.

  3. Actor A1 is terminated.

  4. Actor A2 (for model M2) is created.

  5. All type B tasks are executed.

Ideally, if resources are available, both models should run in parallel to maximize efficiency.

Summarizing, I’m looking for a way to define tasks independently of their execution strategy and let Ray automatically schedule the most efficient execution plan—minimizing model loading/offloading.

Any insights or best practices would be greatly appreciated!

It seems likely that Ray does not provide a built-in way to automatically serialize the creation and destruction of actors based on resource availability, nor to automatically switch between serial and parallel execution of actors depending on cluster resources. You can, however, implement this logic manually: create Actor A1 (with model M1), run all type A tasks, terminate A1, then create Actor A2 (with model M2), and run all type B tasks. If resources allow, you can instantiate both actors and run tasks in parallel. Ray will queue actor/task creation if resources are insufficient, but it will not automatically serialize actor lifecycles for you or minimize model loads/offloads without explicit orchestration in your code (see this discussion, Ray Actors docs).

For best practices, you might consider writing orchestration logic that checks available resources (e.g., via ray.available_resources()), then decides whether to run both actors in parallel or to run them sequentially. This approach gives you control over model loading/offloading and ensures efficient resource usage, but it is not handled automatically by Ray (see this example, Ray resource scheduling). Would you like more detail or a code example for this orchestration pattern?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM,is there a way to inspect the dag of tasks and the required resources and then change the scheduling according to the available resources and the constraint of minimizing the model loading /offloading?

My understanding is that Ray does not provide a built-in API to inspect the full DAG of tasks and actors, analyze their resource requirements, and then automatically adjust scheduling to minimize model loading/offloading. Ray’s scheduler considers resource requirements and dependencies at scheduling time, but it does not expose a high-level DAG inspection or optimization interface for this purpose (Ray scheduling docs). You can, however, manually construct your own DAG (e.g., using Ray DAG or workflow libraries), inspect it, and implement custom orchestration logic to schedule actors/tasks in a way that minimizes model loads (Ray Core docs).

If you want to optimize for minimal model loading/offloading, you would need to build this logic yourself: analyze your workflow, group tasks by model, and schedule them in batches according to available resources. Ray will not automatically reorder or batch tasks to minimize model swaps. Would you like more detail or a code example for building such a custom scheduler?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM , provide a code example please.

Certainly! Here’s a simple example of manual orchestration: you check available resources, then decide whether to run both actors (models) in parallel or serialize them to minimize model loading/offloading. This pattern is not automatic in Ray; you must implement the logic yourself (see this example, Ray resource scheduling).

import ray
import time

ray.init()

@ray.remote
class ModelActor:
    def __init__(self, model_name):
        print(f"Loading model {model_name}")
        self.model_name = model_name
        time.sleep(2)  # Simulate model loading

    def run_task(self, x):
        return f"{self.model_name} processed {x}"

def run_tasks_serial():
    # Run all A tasks with model M1
    actor_a = ModelActor.remote("M1")
    results_a = ray.get([actor_a.run_task.remote(i) for i in range(5)])
    ray.kill(actor_a)
    # Run all B tasks with model M2
    actor_b = ModelActor.remote("M2")
    results_b = ray.get([actor_b.run_task.remote(i) for i in range(5, 10)])
    ray.kill(actor_b)
    return results_a + results_b

def run_tasks_parallel():
    actor_a = ModelActor.remote("M1")
    actor_b = ModelActor.remote("M2")
    results_a = [actor_a.run_task.remote(i) for i in range(5)]
    results_b = [actor_b.run_task.remote(i) for i in range(5, 10)]
    results = ray.get(results_a + results_b)
    ray.kill(actor_a)
    ray.kill(actor_b)
    return results

# Example: check available resources and choose strategy
resources = ray.available_resources()
if resources.get("CPU", 0) >= 2:
    print("Running in parallel")
    results = run_tasks_parallel()
else:
    print("Running serially")
    results = run_tasks_serial()

print(results)

This code checks available CPUs and chooses whether to run both actors in parallel or serialize them, minimizing model loads if resources are limited.

Sources:

Hint: Mention @RunLLM in the post for followups.