Enforcing CPU-only defaults while reserving specific tasks for GPU nodes in Ray cluster

Hi all,

I’m building a small Ray cluster with:

  • 1 CPU node (uniform environment, default target for most work)
  • 1 GPU node (different environment; more heterogeneous GPU nodes coming later)

Goal:

  • Every Ray task/actor should default to CPU nodes.
  • Only a handful of tasks should ever reach GPU nodes (each GPU node has different software).

What I’ve tried / pain points:

  • Custom resources or node labels act like a whitelist, but I’d have to annotate *every* CPU-bound @ray.remote call to keep them off the GPU node. Also it would actually be great to have some cpu remotes that work with the same data to be able to run on the same node.. so data does not have to move around..
  • Setting the GPU node’s num_cpus=0 avoids CPU work landing there, but then I must define another custom resource just so CPU-aware code keeps working—feels brittle.
  • I’d prefer “opt-in access” to the GPU node (e.g., “node only schedulable for tasks needing ≥1 custom ticket”), rather than “node provides up to N tickets.”

Questions:

  1. Is there a built-in way to set a cluster-wide default resource/tag so only explicitly opted-in tasks run on GPU nodes?
  2. How do others manage heterogeneous nodes (different hardware/software) without modifying hundreds of existing @ray.remote decorators?
  3. Any best practices for ensuring GPU nodes only host tasks that “require at least one GPU,” even if they also have CPUs available?

Thanks for any guidance!

In short:

  • is there some way to set a default task label filter or resource requirement
  • or default environment?
  • any way so that i dont have to set the a label filter manually on ALL ray remote calls?

There is currently no built-in Ray feature to globally restrict scheduling of non-GPU tasks to only CPU nodes, or to make GPU nodes “opt-in” for only GPU-requiring tasks, without annotating every @ray.remote call or using custom resources/labels. The Ray scheduler prefers to use available nodes, so CPU-only tasks may be scheduled on GPU nodes if those nodes have available CPUs, unless you explicitly prevent it via resource requirements or by setting num_cpus=0 on GPU nodes (which, as you noted, is brittle and can break code expecting CPUs) [Ray GitHub Issue #47866] [Ray Discuss: How to prevent scheduling non-GPU tasks to GPU nodes].

Best practices today are: (1) use custom resources or node labels to tag CPU nodes and require those resources for CPU tasks, but this does require updating all relevant @ray.remote calls; (2) set num_cpus=0 on GPU nodes, but then you must use a custom resource for all tasks, which is not ideal; (3) use placement groups or node affinity for fine-grained control, but this also requires code changes. There is ongoing discussion and feature requests for a more Kubernetes-like taints/tolerations or “opt-in” scheduling model, but it is not yet implemented [Ray GitHub Issue #47866] [Ray Discuss: Heterogeneous clusters]. Would you like more detail or code examples for any of these approaches?

Sources:

Hint: Mention @RunLLM in the post for followups.