Hi all,
I’m building a small Ray cluster with:
- 1 CPU node (uniform environment, default target for most work)
- 1 GPU node (different environment; more heterogeneous GPU nodes coming later)
Goal:
- Every Ray task/actor should default to CPU nodes.
- Only a handful of tasks should ever reach GPU nodes (each GPU node has different software).
What I’ve tried / pain points:
- Custom resources or node labels act like a whitelist, but I’d have to annotate *every* CPU-bound
@ray.remotecall to keep them off the GPU node. Also it would actually be great to have some cpu remotes that work with the same data to be able to run on the same node.. so data does not have to move around.. - Setting the GPU node’s
num_cpus=0avoids CPU work landing there, but then I must define another custom resource just so CPU-aware code keeps working—feels brittle. - I’d prefer “opt-in access” to the GPU node (e.g., “node only schedulable for tasks needing ≥1 custom ticket”), rather than “node provides up to N tickets.”
Questions:
- Is there a built-in way to set a cluster-wide default resource/tag so only explicitly opted-in tasks run on GPU nodes?
- How do others manage heterogeneous nodes (different hardware/software) without modifying hundreds of existing
@ray.remotedecorators? - Any best practices for ensuring GPU nodes only host tasks that “require at least one GPU,” even if they also have CPUs available?
Thanks for any guidance!
In short:
- is there some way to set a default task label filter or resource requirement
- or default environment?
- any way so that i dont have to set the a label filter manually on ALL ray remote calls?