Ray tune pytorch example

I am trying to run the pytorch CIFAR training example with 10 trials on my ray cluster on AWS. I am getting the following errors at the end:

Best trial config: {'l1': 64, 'l2': 32, 'lr': 0.0012335650878237904, 'batch_size': 4}
Best trial final validation loss: 1.1528024486728012
Best trial final validation accuracy: 0.609
Traceback (most recent call last):
  File "test_ray_pytorch.py", line 269, in <module>
    main(num_samples=10, max_num_epochs=10, gpus_per_trial=0)
  File "test_ray_pytorch.py", line 233, in main
  File "test_ray_pytorch.py", line 169, in test_best_model
    checkpoint_path = os.path.join(best_trial.checkpoint.value, "checkpoint")
  File "/home/ray/anaconda3/lib/python3.8/posixpath.py", line 76, in join
    a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType

So it seems that the checkpoint path or value is null. Is there something else I need to setup to ensure that this example runs?
It seems like the checkpointing is not done but looking at the code I cannot tell where the checkpoint path is specified. Is this something taken care by ray tune under the hood?

Also, looking into the head node under ray_results and looking into the directory created for each of the trial, I do not see any checkpoint file. I see the following:


BTW, are you running the cifar10_pytorch example out of box? Did you modify anything?

We actually have this example running as part of our CI test. Looking at the dashboard, I don’t see an issue with this test.