Hey there, I just had a quick question, does Ray have anything for automatically/dynamically distributing compute and resources during training? Such as detecting a device/clusters total resources and scaling up the parallelism/workers until it fills up the capacity (with some buffer ofc).
1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.
Yes, Ray provides built-in autoscaling and dynamic resource scheduling. Ray automatically detects available cluster resources (CPUs, GPUs, memory) and schedules tasks or training workers to utilize the cluster efficiently, scaling up or down as needed based on workload demand. You can also use APIs like ray.autoscaler.sdk.request_resources to manually request more resources, but Ray’s autoscaler typically handles this automatically for most ML and distributed workloads, including Ray Train and Ray Data jobs. The autoscaler will add or remove worker nodes to match resource demand, with configurable min/max limits and idle timeouts for cost efficiency. For more, see Ray Autoscaler, request_resources API, and cost-performance docs.
Would you like more detail or code examples on how to configure or use Ray’s autoscaling and dynamic scheduling?