How to use time_total_s as a stop condition?

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi,

Thank you for an easy-to-use Ray Tune. I am new to Ray Tune and I am trying to use it to tune parameter values of a iterative non-learnable method. The method has lots of parameters so it will take a while to optimize all parameter values. However, the machine that I would like to run Ray Tune automatically terminates a program after it ran for 72 hours. I would like Ray Tune to terminate properly a few minutes before it hits 72 hours. I have done the following

result=tune.run(
    tune.with_parameters(partial(optimize, args, relative_data_paths)),
    name=args.tuning_exp_name,
    resources_per_trial={"cpu": args.n_cpus, "gpu": args.n_gpus},
    config=config,
    stop={'time_total_s':args.stop_time_total_h*3600},
    num_samples=args.n_samples, # number of trials
    scheduler=ASHAScheduler(),
    metric='score',
    mode=args.tuning_mode,
    fail_fast=True, # To stop the entire Tune run as soon as any trial errors
    log_to_file=True # save stdout and stderr to trial_logdir/stdout and trial_logdir/stderr
    ) 

where args is a command-line input. args.stop_time_total_h is in hour. To test whether the optimization stops after certain time, I tested with args.stop_time_total_h=0.05 which is 3 minutes. It seemed Ray Tune ran all the trials regardless of stop={'time_total_s':args.stop_time_total_h*3600}.

Could anyone tell whether I did something wrong?

Hi @sreaung,

are you reporting intermediate results to Ray Tune using tune.report()?

The way the tuning loop is implemented the stopping conditions will only be considered when a new result is received. This result will contain time_total_s automatically and stop if the conditions are met. But if you don’t report anything until the very end, Tune has no information on what to act and cannot stop preemptively.

Another reason why you would want to do that is that otherwise you won’t have any results to analyze within Ray Tune or the experiment checkpoint - after all, Tune received no metrics.

As a side note, the stop condition you specified is per trial. So if a trial only started say 40 hours in, it will run for another 72 hours, which makes a total experiment runtime of 112 hours. I think you might be looking for the tune.run(time_budget_s=xxx) parameter which will stop the whole experiment after xxx seconds.

1 Like

Thanks so much, Kai! Your explanation is very helpful. I did report metric values using tune.report(), but now I know why the program did not terminate. As you mentioned, I should use tune.run(time_budget_s=xxx). Thanks so much again for your kind help and for the wonderful library!