How to share variables between tune Trials

karapost · January 11, 2022, 2:57pm

Hi!

I’ve been using ray.tune for a while to run multiple machine learning experiments, each of which saves the model at the end of training. Now, I usually pick the best model at the end of all experiments using analysis.get_best_trial. The issue with this is that my disk-space resources are limited and I would like to avoid saving the model of a bad experiment (even a single checkpoint). To do so, my first thought was to have a lock and an int value that I can use to keep track if I should save this run.
A simplified extract from my code looks like this:

from multiprocessing import Lock

top_value = 0
lock = Lock()

....
def tune_run_train(conf):
    """
    Function passed to tune.run
    """
   global top_value
   global lock

   # training and validation
   experiment_value = evaluation_function()
   to_save = False

   with lock:
      # inside the lock
      if experiment_value > top_value:
           to_save = True
           top_value = experiment_value

   if to_save:
      # save model...

However, the following code does not work because the lock objects (and the top_value) are all different from each other instead of being shared between the Trials.

Is there, then, a better solution or a workaround?

Thank you in advance for your time.

Edit: Someone had a similar problem here, however there was no answer

karapost · January 12, 2022, 2:10pm

Found a solution by myself, posting it here for people encountering the same problem.

Basically, I followed the code snippet here that employs a FileLock to synchronize the processes.

Below I post the code I used to avoiding saving each single model:

run_metric = metrics_values[OPTIMIZING_METRIC]
        to_save = False
        with FileLock('../file.lock'):
            file_path = '../top_values_tmp.npz'
            if not os.path.isfile(file_path):
                np.savez(file_path, tops=np.zeros(3))

            with np.load(file_path) as array_dict:
                top_values = array_dict['tops']
                argmin = np.argmin(top_values)
                if top_values[argmin] < run_metric:
                    print(f'Run saved! - {top_values} , {run_metric}')
                    to_save = True
                    top_values[argmin] = run_metric
                    np.savez(file_path, tops=top_values)
                else:
                    print(f'Run not saved! - {top_values} , {run_metric}')

        # Save
        if to_save:
             ...

I simply store a npy array with the current top-3 best performing model’s metrics and save the current model only if its metric is better than any of the top-3. The solution works well in my case but ofc it won’t work whenever consecutive hyperparameter choices increment the metric value.

xwjiang2010 · January 18, 2022, 8:41pm

This is not currently supported in tune. We could potentially add something to tune API to enforce checkpointing only the best trial (and maintains that as more trials are finished).

Glad that you found something that works for your case. One thing I want to point out is that FileLock method probably only works for single machine case? For trials distributed across multiple machines, this is not likely to work.

Topic		Replies	Views
Save model without checkpoint Ray Tune	0	403	October 28, 2021
Saving best model at the end of the training Ray Tune	4	3571	June 28, 2024
Performance Bottleneck in saving model during training Ray Tune	4	514	January 31, 2023
Different trial on CPU and GPU separately? Ray Tune	8	553	April 4, 2022
Train with tune doesnt set the right logdir Ray Train	9	1307	June 23, 2022

How to share variables between tune Trials

Related topics