Multiple trials, Tune, and the Autoscaler

Vishnu · March 2, 2021, 5:28pm

We’ve been using Tune heavily for almost a year now with the Trainable class flow, and I’ve always noticed that when a job with multiple trials is run, the first trial has to start and get going for more than an epoch before the other trials even start. In cases where the startup takes a lot of time, this is not ideal.

Would really appreciate any thoughts on this! Looking for ways to fix it or get around it.

rliaw · March 2, 2021, 11:12pm

Hey @Vishnu , could you help provide more details about what you’re doing? Are you using the autoscaler, queue_trials, etc?

Another possible step is to try the latest nightly version of Ray?

We recently implemented [tune] enable placement groups per default by krfricke · Pull Request #13906 · ray-project/ray · GitHub which should allow many trials to start at the same time and kick off autoscaling.

Vishnu · March 3, 2021, 9:25am

Sure, @rliaw. We’re using the tune.run function and we’re passing along a list of Experiment objects as the first argument. queue_trials is set to True. And yes this was observed when using the Autoscaler on AWS without Docker.

To detail a specific scenario in case that helps, here’s what happened with Ray version 1.0.1 yesterday. A training job with 40 trials was launched. The setup of a single trial ran for an hour, which was expected, but the other 39 trials weren’t started at all during that time. However, as soon as the setup of the first trial completed and an epoch had gone by, the other trials all started executing their setup steps in parallel and ran without issue.

Okay, it sounds like using Placement Groups might solve this problem. Let me try the nightly release and update you.

Topic		Replies	Views
Trouble with some results from Ray Tune	1	42	August 7, 2024
Increase tune's concurrent trials to trigger Autoscaling Ray Clusters	2	457	March 31, 2023
Ray exec multiple scripts w/ tune.run() to same ray cluster Ray Tune	18	1450	February 14, 2021
Most runs immediately failing with "out of memory" Ray Tune	5	1228	May 11, 2021
Ray tune Multi-tenancy Ray Tune	2	344	October 5, 2023

Multiple trials, Tune, and the Autoscaler

Related topics