Why ray tune restarts my trainings?

I am trying to train 1000 different networks, each one of them with a different configuration that I have labelled with an integer number.

What I want, at the current stage, is to simply perform an initial training (with the same number of epochs) on all of them.

Therefore, I run an analysis with the following parameters:

analysis = ray.tune.run(
    TrainerWrapper,
    name="initial_training",
    stop=stopper,
    config={
        "configuration": ray.tune.grid_search(models_to_be_trained),
        'lr': LR,
        'gradient_clip': GRADIENT_CLIP
    },
    num_samples=1,
    local_dir=RAY_CHECKPOINTS_DIR,
    reuse_actors=True,
    resume=True,
    checkpoint_at_end=True
)

where LR and GRADIENT_CLIP are constants, models_to_be_trained is a list of integers and stopper is a simple function that stops if we reach enough epochs or if we overfit

def stopper(_, result):
    training_iteration = result["training_iteration"]
    test_linfty_error = result["test_average_linfty_error"]
    train_linfty_error = result["train_average_linfty_error"]
    if training_iteration > MAX_EPOCHS // EPOCHS_PER_STEP:
        return True

    if test_linfty_error > 1.5 * train_linfty_error:
        return True

    return False

I would expect this code to perform the test on all mt configuration once (because I fixed “num_samples = 1” in the analysis) and to perform 100 iterations (because this is the ratio of MAX_EPOCHS // EPOCHS_PER_STEP). But what I see is that ray restarts my test when they reach 20 iterations. Indeed this is what I see at some point in the logs:

(run pid=82760) == Status ==
(run pid=82760) Current time: 2022-06-19 18:25:27 (running for 01:12:07.44)
(run pid=82760) Memory usage on this node: 234.1/377.6 GiB
(run pid=82760) Using FIFO scheduling algorithm.
(run pid=82760) Resources requested: 256.0/512 CPUs, 15.360000000000039/16 GPUs, 0.0/1026.51 GiB heap, 0.0/443.92 GiB objects (0.0/4.0 accelerator_type:V100)
(run pid=82760) Result logdir: /home/spiani/ray_results/resnet_train_small_models
(run pid=82760) Number of trials: 336/1000 (80 PENDING, 256 RUNNING)
(run pid=82760) +----------------------------+----------+-----------------+-----------------+--------+------------------+-----------------------------+
(run pid=82760) | Trial name                 | status   | loc             |   configuration |   iter |   total time (s) |   test_average_linfty_error |
(run pid=82760) |----------------------------+----------+-----------------+-----------------+--------+------------------+-----------------------------|
(run pid=82760) | TrainerWrapper_ba6d9_00000 | RUNNING  | 10.1.14.1:37692 |             386 |     18 |          82982.5 |                    0.407178 |
(run pid=82760) | TrainerWrapper_ba6d9_00001 | RUNNING  | 10.1.14.1:37875 |             389 |     18 |          83199.6 |                    0.417423 |
(run pid=82760) | TrainerWrapper_ba6d9_00002 | RUNNING  | 10.1.14.1:37958 |             392 |     18 |          83383.8 |                    0.435176 |
(run pid=82760) | TrainerWrapper_ba6d9_00003 | RUNNING  | 10.1.14.1:37960 |             395 |     18 |          83757.9 |                    0.494883 |
(run pid=82760) | TrainerWrapper_ba6d9_00004 | RUNNING  | 10.1.14.1:37962 |             242 |     19 |          85124.9 |                    0.401442 |
(run pid=82760) | TrainerWrapper_ba6d9_00005 | RUNNING  | 10.1.14.1:37964 |             245 |     19 |          85538.3 |                    0.371104 |
(run pid=82760) | TrainerWrapper_ba6d9_00006 | RUNNING  | 10.1.14.1:37966 |             248 |     19 |          85764.5 |                    0.373599 |
(run pid=82760) | TrainerWrapper_ba6d9_00007 | RUNNING  | 10.1.14.1:37968 |             251 |     19 |          85316.9 |                    0.396243 |
(run pid=82760) | TrainerWrapper_ba6d9_00008 | RUNNING  | 10.1.14.1:37973 |             385 |     18 |          82935.5 |                    0.407906 |
(run pid=82760) | TrainerWrapper_ba6d9_00009 | RUNNING  | 10.1.14.1:37982 |             388 |     18 |          83003.1 |                    0.464052 |
(run pid=82760) | TrainerWrapper_ba6d9_00256 | PENDING  |                 |            2118 |        |                  |                             |
(run pid=82760) | TrainerWrapper_ba6d9_00257 | PENDING  |                 |            2982 |        |                  |                             |
(run pid=82760) | TrainerWrapper_ba6d9_00258 | PENDING  |                 |            2121 |        |                  |                             |
(run pid=82760) | TrainerWrapper_ba6d9_00259 | PENDING  |                 |            2985 |        |                  |                             |
(run pid=82760) | TrainerWrapper_ba6d9_00260 | PENDING  |                 |            3002 |        |                  |                             |
(run pid=82760) | TrainerWrapper_ba6d9_00261 | PENDING  |                 |            3005 |        |                  |                             |
(run pid=82760) | TrainerWrapper_ba6d9_00262 | PENDING  |                 |            1718 |        |                  |                             |
(run pid=82760) | TrainerWrapper_ba6d9_00263 | PENDING  |                 |            3008 |        |                  |                             |
(run pid=82760) | TrainerWrapper_ba6d9_00264 | PENDING  |                 |            1721 |        |                  |                             |
(run pid=82760) | TrainerWrapper_ba6d9_00265 | PENDING  |                 |            3011 |        |                  |                             |
(run pid=82760) +----------------------------+----------+-----------------+-----------------+--------+------------------+-----------------------------+
(run pid=82760) ... 316 more trials not shown (246 RUNNING, 70 PENDING)
(run pid=82760) 
(run pid=82760) 
(run pid=82760) Result for TrainerWrapper_ba6d9_00006:
(run pid=82760)   date: 2022-06-19_18-25-27
(run pid=82760)   done: false
(run pid=82760)   experiment_id: fdbd45174bd443d6954dc3727c87ae3e
(run pid=82760)   hostname: pn14-04
(run pid=82760)   iterations_since_restore: 1
(run pid=82760)   node_ip: 10.1.14.4
(run pid=82760)   pid: 103245
(run pid=82760)   test_average_l2_error: 7.358240948486328
(run pid=82760)   test_average_l2_relative_error: 0.20228974905014038
(run pid=82760)   test_average_linfty_error: 0.5165811861673991
(run pid=82760)   test_average_linfty_relative_error: 0.24805314655303956
(run pid=82760)   test_max_l2_error: 56.37067794799805
(run pid=82760)   test_max_l2_relative_error: 1.1429545879364014
(run pid=82760)   test_max_linfty_error: 4.586096286773682
(run pid=82760)   test_max_linfty_relative_error: 1.3744010925292969
(run pid=82760)   time_since_restore: 4241.323424339294
(run pid=82760)   time_this_iter_s: 4241.323424339294
(run pid=82760)   time_total_s: 4241.323424339294
(run pid=82760)   timestamp: 1655655927
(run pid=82760)   timesteps_since_restore: 0
(run pid=82760)   train_average_l2_error: 7.390049274359809
(run pid=82760)   train_average_l2_relative_error: 0.20256807630327012
(run pid=82760)   train_average_linfty_error: 0.5167996492173936
(run pid=82760)   train_average_linfty_relative_error: 0.24799219224717883
(run pid=82760)   train_max_l2_error: 54.83803939819336
(run pid=82760)   train_max_l2_relative_error: 1.1142292022705078
(run pid=82760)   train_max_linfty_error: 4.486876487731934
(run pid=82760)   train_max_linfty_relative_error: 1.2893394231796265
(run pid=82760)   training_iteration: 1
(run pid=82760)   trial_id: ba6d9_00006
(run pid=82760)   
(run pid=82760) == Status ==
(run pid=82760) Current time: 2022-06-19 18:25:32 (running for 01:12:12.71)
(run pid=82760) Memory usage on this node: 234.2/377.6 GiB
(run pid=82760) Using FIFO scheduling algorithm.
(run pid=82760) Resources requested: 256.0/512 CPUs, 15.360000000000039/16 GPUs, 0.0/1026.51 GiB heap, 0.0/443.92 GiB objects (0.0/4.0 accelerator_type:V100)
(run pid=82760) Result logdir: /home/spiani/ray_results/resnet_train_small_models
(run pid=82760) Number of trials: 336/1000 (80 PENDING, 256 RUNNING)
(run pid=82760) +----------------------------+----------+------------------+-----------------+--------+------------------+-----------------------------+
(run pid=82760) | Trial name                 | status   | loc              |   configuration |   iter |   total time (s) |   test_average_linfty_error |
(run pid=82760) |----------------------------+----------+------------------+-----------------+--------+------------------+-----------------------------|
(run pid=82760) | TrainerWrapper_ba6d9_00000 | RUNNING  | 10.1.14.1:37692  |             386 |     18 |         82982.5  |                    0.407178 |
(run pid=82760) | TrainerWrapper_ba6d9_00001 | RUNNING  | 10.1.14.1:37875  |             389 |     18 |         83199.6  |                    0.417423 |
(run pid=82760) | TrainerWrapper_ba6d9_00002 | RUNNING  | 10.1.14.1:37958  |             392 |     18 |         83383.8  |                    0.435176 |
(run pid=82760) | TrainerWrapper_ba6d9_00003 | RUNNING  | 10.1.14.1:37960  |             395 |     18 |         83757.9  |                    0.494883 |
(run pid=82760) | TrainerWrapper_ba6d9_00004 | RUNNING  | 10.1.14.1:37962  |             242 |     19 |         85124.9  |                    0.401442 |
(run pid=82760) | TrainerWrapper_ba6d9_00005 | RUNNING  | 10.1.14.1:37964  |             245 |     19 |         85538.3  |                    0.371104 |
(run pid=82760) | TrainerWrapper_ba6d9_00006 | RUNNING  | 10.1.14.4:103245 |             248 |      1 |          4241.32 |                    0.516581 |
(run pid=82760) | TrainerWrapper_ba6d9_00007 | RUNNING  | 10.1.14.1:37968  |             251 |     19 |         85316.9  |                    0.396243 |
(run pid=82760) | TrainerWrapper_ba6d9_00008 | RUNNING  | 10.1.14.1:37973  |             385 |     18 |         82935.5  |                    0.407906 |
(run pid=82760) | TrainerWrapper_ba6d9_00009 | RUNNING  | 10.1.14.1:37982  |             388 |     18 |         83003.1  |                    0.464052 |
(run pid=82760) | TrainerWrapper_ba6d9_00256 | PENDING  |                  |            2118 |        |                  |                             |
(run pid=82760) | TrainerWrapper_ba6d9_00257 | PENDING  |                  |            2982 |        |                  |                             |
(run pid=82760) | TrainerWrapper_ba6d9_00258 | PENDING  |                  |            2121 |        |                  |                             |
(run pid=82760) | TrainerWrapper_ba6d9_00259 | PENDING  |                  |            2985 |        |                  |                             |
(run pid=82760) | TrainerWrapper_ba6d9_00260 | PENDING  |                  |            3002 |        |                  |                             |
(run pid=82760) | TrainerWrapper_ba6d9_00261 | PENDING  |                  |            3005 |        |                  |                             |
(run pid=82760) | TrainerWrapper_ba6d9_00262 | PENDING  |                  |            1718 |        |                  |                             |
(run pid=82760) | TrainerWrapper_ba6d9_00263 | PENDING  |                  |            3008 |        |                  |                             |
(run pid=82760) | TrainerWrapper_ba6d9_00264 | PENDING  |                  |            1721 |        |                  |                             |
(run pid=82760) | TrainerWrapper_ba6d9_00265 | PENDING  |                  |            3011 |        |                  |                             |
(run pid=82760) +----------------------------+----------+------------------+-----------------+--------+------------------+-----------------------------+
(run pid=82760) ... 316 more trials not shown (246 RUNNING, 70 PENDING)
(run pid=82760) 

If you look at the test number 6 (configuration 248 for my code), you will see that, at some point, it goes from step 19 to step 1 and it is moved to another machine. Why this happens? How can I prevent this? I am running ray 1.11 but I have tried several different versions and I always see the same behaviour.

Thank you!

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Some other details that I forgot to mention:

  • It always happens at iteration 20 (so after 19 I see 1 instead of 20)
  • It happens for all the samples

Hey @spiani, are there any additional logs printed for the Trial before it restarts?

Also, would it be possible to reproduce this in a smaller setting? For example, subbing out the training code to just report the same values repeatedly to verify that the stopper stops at 100 iterations as expected.

Looking over this, one thing that stands out is that the timing of the restart is incredibly close to exactly 24 hours. Not sure if this is a red herring, but could it be possible that the nodes that this is running on are terminated after 24 hours?

Unfortunately, there aren’t any additional logs: ray seems so continue its execution without any warning or error, it just restarts the training.

I did a test removing my current “step” method and using the following:

  def step(self):
      sleep(1)
      return {
          'test_average_linfty_error': 1,
          'train_average_linfty_error': 1
      }

everything else was the same, especially the reset_config and the load/save_checkpoint mechanism.
In this case, everything seems to be working fine (and I stop at step 101). Therefore, the problem seems to be related with the execution time. I will try to run this test again, changing sleep(1) with sleep(3600) and then with sleep(7200): in this way, I can see if it restarts after a fixed delay or after a fixed number of steps. Unfortunately, this test will require at least one day, so I will give you the results tomorrow.

For what concerns the execution time, it is true that I run this test on a cluster using a job that has 24 hours of time allocated; but the first job of the test already stopped and I resumed it from another job. When I resumed it, the samples where around step 16, so maybe this is not correlated with the problem.

I will let you know the results of my test with the sleep. Thank you!

Hey @spiani were you able to run the job with the sleep timer?

Yep, and I realized that the problem is not related with the time or with the number of iterations but with the fact that I restart the ray executions. Let me give some more details about my setup.
Currently, I am running on a cluster with Slurm. The longest time that I can request for a job is 24 hours and, therefore, I am forced to split my training into multiple executions. So on each job I run ray and train simultaneously several different networks until the job runs out of time. Then I simply start a new job setting “resume = True” into ray.

When I run the first job, everything works fine. When I run the second one, ray is aware of the results that I have obtained from the previous job but, for a fixed sample, as soon as the first new iteration is finished, the sample is reset to 1 iteration. And this happens for all the samples. I have not realized that at the beginning because it takes a while for an iteration to finish and I am training multiple samples (much more than the ones that are printed in the table) so I had the wrong feeling that the iterations were increasing normally at the beginning of the second job execution but instead the table was simply showing different samples.

Is there a log where I can read what happens exactly when a sample returns the results of one iteration? Thank you!

Ah gotcha, thanks for the update here.

It seems like what’s happening is that you have set up experiment level resume logic (Stopping and Resuming Tune Trials — Ray 1.13.0), but might be missing the trial level resume logic.

In other words, when you are stopping the job on the first cluster, and resuming it on the first cluster, it knows the experiment state (e.g. you’re running 10 Trials and they’re all on epoch 16). However, in order to resume a Trial from epoch 16, you actually need to set up checkpointing logic so that your Trial can save/restore the proper state. See A Guide To Using Checkpoints — Ray 1.13.0 for details on this.

Is there a log where I can read what happens exactly when a sample returns the results of one iteration? Thank you!

You can find the log files in your result directory (e.g. /home/spiani/ray_results/resnet_train_small_models). Each Trial has its own directory, and in it progress is recorded in progress.csv and result.json.

Ok, I think that I solved my problem. Everything seems to be working fine now simply because I have added the option:
checkpoint_freq=n
(where n can be any integer) to ray.tune.analysis.

In my old code, I never specified when to perform a checkpoint and, therefore, ray had never done any. When I stopped the execution, ray printed the following line

WARNING tune.py:643 -- Experiment has been interrupted, but the most recent state was saved. You can continue running this experiment by passing `resume=True` to `tune.run()`

and this led me to think that ray had automatically created a checkpoint when the execution had been stopped. Instead this seems not to be the case, and therefore I have to manually enforce the checkpoint.

Thank you for your help!