Hi everyone,
I’m currently working on a project that involves scaling machine learning workloads using Ray, and I’ve run into a few architectural questions I was hoping to get some community insights on.
Our current setup involves a hybrid cluster: a mix of on-premise GPU nodes and cloud-based CPU instances (AWS EC2). We’re using Ray mainly for distributed hyperparameter tuning (with Tune) and some parallel data preprocessing tasks. So far, the flexibility has been great, but managing the performance and reliability between local and cloud nodes is proving to be tricky.
Here are a few specific things I’d love feedback on:
- Is it advisable to designate the head node on the cloud side or on-prem, and what tradeoffs have you experienced either way?
- Are there any known best practices for ensuring tasks and actors are scheduled efficiently across such a heterogeneous environment?
- We occasionally see workers timing out or failing silently when workloads are mixed between local and remote. Has anyone dealt with this kind of instability, and what monitoring/debugging tools do you recommend?
- For autoscaling on the cloud side, do folks generally rely on Ray’s built-in cluster launcher or do you integrate with something like K8s for better control?
If there’s any documentation, example repos, or resources—especially those related to mulesoft training in hyderabad—that are particularly relevant, I’d appreciate being pointed in the right direction as well.
Thanks in advance! Looking forward to learning from your experiences.