Ray tune pytorch example

pamparana · February 22, 2022, 9:15pm

I am trying to run the pytorch CIFAR training example with 10 trials on my ray cluster on AWS. I am getting the following errors at the end:

Best trial config: {'l1': 64, 'l2': 32, 'lr': 0.0012335650878237904, 'batch_size': 4}
Best trial final validation loss: 1.1528024486728012
Best trial final validation accuracy: 0.609
Traceback (most recent call last):
  File "test_ray_pytorch.py", line 269, in <module>
    main(num_samples=10, max_num_epochs=10, gpus_per_trial=0)
  File "test_ray_pytorch.py", line 233, in main
    test_best_model(best_trial)
  File "test_ray_pytorch.py", line 169, in test_best_model
    checkpoint_path = os.path.join(best_trial.checkpoint.value, "checkpoint")
  File "/home/ray/anaconda3/lib/python3.8/posixpath.py", line 76, in join
    a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not NoneType

So it seems that the checkpoint path or value is null. Is there something else I need to setup to ensure that this example runs?
It seems like the checkpointing is not done but looking at the code I cannot tell where the checkpoint path is specified. Is this something taken care by ray tune under the hood?

Also, looking into the head node under ray_results and looking into the directory created for each of the trial, I do not see any checkpoint file. I see the following:

events.out.tfevents.1645561736.test-ray-cluster-ray-head-type-xndrw  
params.json  
params.pkl  
progress.csv  
result.json

xwjiang2010 · February 23, 2022, 7:58pm

BTW, are you running the cifar10_pytorch example out of box? Did you modify anything?

We actually have this example running as part of our CI test. Looking at the dashboard, I don’t see an issue with this test.

Topic		Replies	Views
Cannot find checkpoint when gpus_per_trial > 0 Ray Tune	8	629	February 28, 2023
ValueError: The returned checkpoint path must be within the given checkpoint dir Ray Tune	7	401	January 25, 2021
Trial checkpointing	0	292	June 16, 2023
Setting a CheckpointConfig doesn't seem to filter out checkpoints correctly Ray Core	3	268	March 26, 2024
Possibly Checkpoint error while running Ray tune	4	1230	December 2, 2022

Ray tune pytorch example

Related topics