Hi everyone,
I’m currently working on setting up a distributed ML pipeline using Ray, and I could use some guidance on structuring things efficiently using Ray Workflows.
The use case involves multiple stages: data preprocessing, model training (potentially across several model variants), evaluation, and then deploying the best model. Each of these stages can be broken down into smaller, reusable components that ideally should run asynchronously and in parallel wherever possible. I’ve already used Ray’s Actors and remote functions for some distributed tasks before, but I’m now exploring Ray Workflows to get better orchestration, checkpointing, and DAG visibility.
Here are a few questions I have:
- What are best practices when building large-scale DAGs with nested dependencies in Ray Workflows?
- How should I handle intermediate data between steps (e.g., large preprocessed datasets) — is it advisable to persist these in external storage or use Ray’s object store?
- For versioning and tracking outputs (like models or metrics), do you integrate with external tools or stick with Ray’s metadata APIs?
- Also, are there examples of modular or templated workflows that support reusability across different ML experiments?
If anyone has experience running production-grade ML pipelines with Ray Workflows, I’d love to hear about the challenges you faced and how you structured your mulesoft training in hyderabad workflows.
Thanks in advance!