[Research] Leveraging Metaheuristics (GA+VNS) for Optimal Task Placement in Heterogeneous Ray Clusters

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.

2. Environment:

  • Ray version: N/A (Looking into potential integration)
  • Python version: 3.10+
  • OS: Linux / Darwin
  • Cloud/Infrastructure: Highly Heterogeneous Clusters (e.g., Mixed ARM/x86, varying GPU generations)

3. What happened vs. what you expected:

  • Expected: A scheduling mechanism that can achieve near-global optimal task-to-node mapping in clusters where hardware performance is “unrelated” (i.e., task performance depends heavily on specific architecture/instruction sets).
  • Actual: Most default schedulers rely on local heuristics (like least-requested or round-robin) which, while fast, often result in sub-optimal makespan and energy waste in complex heterogeneous scenarios.

Additional Details: Metaheuristic-based Global Task Placement

Hi Ray community,

I’m currently researching a hybrid metaheuristic approach (GA + VNS) to address the Unrelated Parallel Machine Scheduling (UPMS) problem in highly heterogeneous distributed systems.

Research Findings:

In our simulations using standard benchmarks (Braun et al.), our approach significantly outperforms classic heuristics (like HEFT or Min-Min) in both Makespan and Energy Consumption (statistically validated with Friedman test p < 0.001).

Proposed Architecture for Ray Integration:

Since metaheuristics (Genetic Algorithms + Variable Neighborhood Search) can be computationally heavy, I am exploring an Asynchronous Optimization Loop pattern:

  1. Background Optimizer: A service that continuously observes the Ray cluster state and the pending task queue. It runs the GA+VNS logic out-of-band to compute a “Global Ideal Mapping.”

  2. Placement Advice: This state is then used to provide “Advice” or ranked weights to the Ray Scheduler, allowing for millisecond-level placement decisions that align with a globally optimal plan.

Questions for the community:

  1. Is hardware heterogeneity (e.g., mixing H100/A100/T4 GPUs or ARM/x86) a major pain point you see when scaling Ray workloads?

  2. Has there been any exploration into search-based/metaheuristic placement policies within the Ray Core scheduler or as a pluggable optimizer?

  3. Since Ray is heavily Python-based, would there be interest in a modular “Optimization Service” that can ingest resource metrics to guide placement?

I would love to get the Ray community’s perspective on how such a pattern could fit into the Ray roadmap!

It seems likely that hardware heterogeneity is a significant challenge for Ray users, especially in large, mixed-architecture clusters. Ray’s default scheduler uses decentralized, local heuristics for efficiency and scalability, but these are not designed for globally optimal placement in highly heterogeneous environments. There is no evidence in the provided sources that Ray Core currently supports metaheuristic or search-based global placement policies, nor is there mention of a pluggable optimizer interface for such strategies. Most scheduling customization is limited to resource tags, label selectors, and placement groups, which are rule-based rather than optimization-driven [3a03c578, 8d78c82d].

There is community interest in more advanced scheduling, as seen in discussions about custom resource prediction models and the need for better heterogeneity support [0ea42cbb, 75f634d2]. However, no official roadmap items or APIs for integrating metaheuristic optimizers are documented. If you are considering building a modular optimization service, it would be a novel contribution, but you may need to work around Ray’s internal scheduling logic or propose changes upstream. Would you like more detail on Ray’s current scheduling extensibility or examples of how others have approached similar problems?

Sources:

Hint: Mention @RunLLM in the post for followups.