1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
2. Environment:
- Ray version: N/A (Looking into potential integration)
- Python version: 3.10+
- OS: Linux / Darwin
- Cloud/Infrastructure: Highly Heterogeneous Clusters (e.g., Mixed ARM/x86, varying GPU generations)
3. What happened vs. what you expected:
- Expected: A scheduling mechanism that can achieve near-global optimal task-to-node mapping in clusters where hardware performance is “unrelated” (i.e., task performance depends heavily on specific architecture/instruction sets).
- Actual: Most default schedulers rely on local heuristics (like least-requested or round-robin) which, while fast, often result in sub-optimal makespan and energy waste in complex heterogeneous scenarios.
Additional Details: Metaheuristic-based Global Task Placement
Hi Ray community,
I’m currently researching a hybrid metaheuristic approach (GA + VNS) to address the Unrelated Parallel Machine Scheduling (UPMS) problem in highly heterogeneous distributed systems.
Research Findings:
In our simulations using standard benchmarks (Braun et al.), our approach significantly outperforms classic heuristics (like HEFT or Min-Min) in both Makespan and Energy Consumption (statistically validated with Friedman test p < 0.001).
Proposed Architecture for Ray Integration:
Since metaheuristics (Genetic Algorithms + Variable Neighborhood Search) can be computationally heavy, I am exploring an Asynchronous Optimization Loop pattern:
-
Background Optimizer: A service that continuously observes the Ray cluster state and the pending task queue. It runs the GA+VNS logic out-of-band to compute a “Global Ideal Mapping.”
-
Placement Advice: This state is then used to provide “Advice” or ranked weights to the Ray Scheduler, allowing for millisecond-level placement decisions that align with a globally optimal plan.
Questions for the community:
-
Is hardware heterogeneity (e.g., mixing H100/A100/T4 GPUs or ARM/x86) a major pain point you see when scaling Ray workloads?
-
Has there been any exploration into search-based/metaheuristic placement policies within the Ray Core scheduler or as a pluggable optimizer?
-
Since Ray is heavily Python-based, would there be interest in a modular “Optimization Service” that can ingest resource metrics to guide placement?
I would love to get the Ray community’s perspective on how such a pattern could fit into the Ray roadmap!