Pytorch Tutorial understanding

Hi Everyone,

I hope this isn’t too noobie of a question. I am new to AI and pytorch.

I have followed the tutorial here:
How to use Tune with PyTorch — Ray 2.0.1

I am a little confused with one part, maximum epochs.

In the code “train_cifar” we see it has:

for epoch in range(10):

and in the main function it says

def main(num_samples=10, max_num_epochs=10, gpus_per_trial=2):

with the ASHAScheduler using this for max_t=max_num_epochs

My question is, what is the difference between these two epoch counters?
I get the one in train_cifar would be how many epochs we use to train the model, but then what does this max_num_epochs in the main function do? Should they just be the same?

I thought that maybe the one in main might rerun the train_cifar function max_num_epoch number of times until train_cifar converged.

Thank you for your help! I am sure this is quite trival for most of you, maybe one day it will be for me too!

Hi!
max_num_epochs is supplied into ASHAScheduler’s initializer (max_t): Trial Schedulers (tune.schedulers) — Ray 2.1.0

It means: max time units per trial. Trials will be stopped after max_t time units (determined by time_attr) have passed.

So say if you specify this bigger than 10, it probably doesn’t make any difference, as the training function only iterates 10 times (for epoch in range(10)). If you specify this to be smaller than 10, the training function will be run less than 10 times, which probably is not what you want. Be default, this max_t is 100. So feel free to leave as it is if your epoch number is only 10.

1 Like

Ah thank you for this explanation!

It sounds like they should probably just be equal.