Multiple trials, Tune, and the Autoscaler

We’ve been using Tune heavily for almost a year now with the Trainable class flow, and I’ve always noticed that when a job with multiple trials is run, the first trial has to start and get going for more than an epoch before the other trials even start. In cases where the startup takes a lot of time, this is not ideal.

Would really appreciate any thoughts on this! Looking for ways to fix it or get around it.

Hey @Vishnu , could you help provide more details about what you’re doing? Are you using the autoscaler, queue_trials, etc?

Another possible step is to try the latest nightly version of Ray?

We recently implemented [tune] enable placement groups per default by krfricke · Pull Request #13906 · ray-project/ray · GitHub which should allow many trials to start at the same time and kick off autoscaling.

Sure, @rliaw. We’re using the tune.run function and we’re passing along a list of Experiment objects as the first argument. queue_trials is set to True. And yes this was observed when using the Autoscaler on AWS without Docker.

To detail a specific scenario in case that helps, here’s what happened with Ray version 1.0.1 yesterday. A training job with 40 trials was launched. The setup of a single trial ran for an hour, which was expected, but the other 39 trials weren’t started at all during that time. However, as soon as the setup of the first trial completed and an epoch had gone by, the other trials all started executing their setup steps in parallel and ran without issue.

Okay, it sounds like using Placement Groups might solve this problem. Let me try the nightly release and update you.

1 Like