Guidance Needed: Best Practices for Scaling Ray Workloads in a Hybrid Cluster Setup

edison · May 28, 2025, 10:40am

Hi everyone,

I’m currently working on a project that involves scaling machine learning workloads using Ray, and I’ve run into a few architectural questions I was hoping to get some community insights on.

Our current setup involves a hybrid cluster: a mix of on-premise GPU nodes and cloud-based CPU instances (AWS EC2). We’re using Ray mainly for distributed hyperparameter tuning (with Tune) and some parallel data preprocessing tasks. So far, the flexibility has been great, but managing the performance and reliability between local and cloud nodes is proving to be tricky.

Here are a few specific things I’d love feedback on:

Is it advisable to designate the head node on the cloud side or on-prem, and what tradeoffs have you experienced either way?
Are there any known best practices for ensuring tasks and actors are scheduled efficiently across such a heterogeneous environment?
We occasionally see workers timing out or failing silently when workloads are mixed between local and remote. Has anyone dealt with this kind of instability, and what monitoring/debugging tools do you recommend?
For autoscaling on the cloud side, do folks generally rely on Ray’s built-in cluster launcher or do you integrate with something like K8s for better control?

If there’s any documentation, example repos, or resources—especially those related to mulesoft training in hyderabad—that are particularly relevant, I’d appreciate being pointed in the right direction as well.

Thanks in advance! Looking forward to learning from your experiences.

Topic		Replies	Views
Challenges with Scaling My Application Using Ray – Need Some Guidance	1	25	April 16, 2025
Need Help with Scaling Up My Ray Cluster Ray Clusters	0	21	July 31, 2024
Optimizing Real-Time ML Model Serving with Ray Serve on AWS GPU Cluster: Best Practices and Resource Allocation Strategies Ray Data	0	201	April 18, 2024
Want advice on Improving Ray for Long Machine Learning Model Training	1	64	July 13, 2024
Best Practices for Optimizing Ray Tune Trials RLlib	0	11	December 17, 2024

Guidance Needed: Best Practices for Scaling Ray Workloads in a Hybrid Cluster Setup

Related topics