Matching ray-tune validation with training validation

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I often see that the optimal “obj” reached by ray-tune does not match the obj I see when I train it with the “optimal” parameters, even after matching random-seeds. If I understood correctly, there can be various reasons for this discrepancy, possibly the main one being that Ray kills trials early and so the “obj” reached is only an estimate of the “obj” I may eventually get from normal training.

But to rule out any other issues, I would like to perform a sanity-check run with appropriate settings in Ray-tune, so that the optimal “obj” exactly matches the validation objective during training objective (e.g. “val_loss”). I imagine there are settings I can use in the scheduler and/or search algo. Any suggestions?

Hi @pchalasani, can you give us more context about the function you’re training and the objective you want to reach? It’s hard to reason about this in an abstract manner. It would also be helpful if you could provide your code with which you don’t currently achieve the expected results, the results you’d want to see, and maybe some context on how you came to these results.

Thanks!

Thanks @kai for taking a look at this. Some context: I am using Pytorch Lightning, and closely followed the ray/tune docs on how to use it with PTL, with ASHAScheduler and TuneReportCallback, with these params.

    tune_callback = TuneReportCallback(
        {
            "obj": "val_auc",    },
        on="validation_end"
    )
    tune_scheduler = ASHAScheduler(
        max_t = 80, 
        grace_period = 60, 
        reduction_factor=2
    )

The full setup is too complex to share here but probably relevant to this discussion, is that the original trainer has other callbacks such as ModelCheckpoint and EarlyStopping.

The issue is that the final best “obj” (val_auc in this case) found by Ray/Tune sometimes is far from the val_auc I see when I re-run the training with the best settings found by Ray/Tune, with the exact same random seed and callbacks (early stopping, checkpoints).

I wanted to understand:

  • in what scenarios could this discrepancy occur? Is it affected by early stopping callbacks?
  • is the objective value found by Ray/Tune based on a “full training run”, or is just some type of estimate? [This one is probably easy to answer, it’s just a gap in my understanding how Ray/Tune works]
  • are there any settings I can use to sanity check my setup, to ensure that the Ray/Tune optimal obj exactly matches the obj from a normal training run using the best hparams found? For example can I use large values of max_t and grace_period in the ASHAScheduler, and eliminate optimal stopping in the original trainer (to remove any effects due to this callback)?

@kai I just posted another question, which may have some bearing on this.

Thanks for the follow-up question! Just for others who run into this question, the main reason I suspect after reading your follow-up is that Ray Tune’s final result usually only considers the last reported result, and not the best seen in the run. This can be specified in calls to experiment analysis methods, and alternatively you can always fetch a full trial result dataframe. Other than that, run should usually be reproducible and we have tests for this in place that this is the case.

Let’s continue the discussion about last/best results in the other thread, and if there are more problems with reproducibility, we can continue here with that.

@kai I just posted another question relevant to this, with a self-contained minimal example