How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Does Ray Tune restore ignore max_concurrent_trials
when restarting errored trials?
I noticed when running a test experiment with Ray tune that when an experiment is restored with the flag restart_errored=True
, it reschedules and runs all of the previously errored trials ignoring the max_concurrent_trials
value.
Is there a way to ensure restarting errored trials respects max_concurrent_trials
on restore?
Example:
- Run experiment
import time
# ray==2.4.0
from ray import air, tune
def train_test(config):
time.sleep(10)
if config["value3"]:
value = config["value1"] + config["value2"]
else:
value = -1
raise ValueError('Mock error...')
return {"score": value}
# Define trial parameters as a single grid sweep.
trial_space = {
"value1": tune.grid_search(range(0, 2)),
"value2": tune.grid_search(range(0, 2)),
"value3": tune.grid_search([True, False]),
}
train_model = tune.with_resources(train_test, {"cpu": 1})
# Start a Tune run and print the best result.
# max_concurrent_trials=2
tuner = tune.Tuner(train_model,
param_space=trial_space,
run_config=air.RunConfig(local_dir="some_path", log_to_file=True),
tune_config=tune.TuneConfig(num_samples=1, max_concurrent_trials=2))
results = tuner.fit()
-
Cancel / pause experiment after two trials errored
-
Restore experiment
# restore the experiment
tuner = tune.Tuner.restore(path="some_path/experiment_run",
trainable=train_model,
restart_errored=True)
results = tuner.fit()
- I notice 4 trials queued and run at the same time
Thank you!