I am trying to run the pytorch CIFAR training example with 10 trials on my ray cluster on AWS. I am getting the following errors at the end:
Best trial config: {'l1': 64, 'l2': 32, 'lr': 0.0012335650878237904, 'batch_size': 4}
Best trial final validation loss: 1.1528024486728012
Best trial final validation accuracy: 0.609
Traceback (most recent call last):
File "test_ray_pytorch.py", line 269, in <module>
main(num_samples=10, max_num_epochs=10, gpus_per_trial=0)
File "test_ray_pytorch.py", line 233, in main
test_best_model(best_trial)
File "test_ray_pytorch.py", line 169, in test_best_model
checkpoint_path = os.path.join(best_trial.checkpoint.value, "checkpoint")
File "/home/ray/anaconda3/lib/python3.8/posixpath.py", line 76, in join
a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType
So it seems that the checkpoint path or value is null. Is there something else I need to setup to ensure that this example runs?
It seems like the checkpointing is not done but looking at the code I cannot tell where the checkpoint path is specified. Is this something taken care by ray tune under the hood?
Also, looking into the head node under ray_results
and looking into the directory created for each of the trial, I do not see any checkpoint file. I see the following:
events.out.tfevents.1645561736.test-ray-cluster-ray-head-type-xndrw
params.json
params.pkl
progress.csv
result.json