Ray Tune + Pytorch: Cannot find model.pth in my experiment folder

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

It stops me from loading best model parameters for test. Not sure yet how how I can get around it easily.

I have implemented a Ray Tune trainable and hyperparameter tuning in a Colab Notebook (Ray version 1.12.0). It all seemed to work fine except that in the experiments folder, I can find files but not the .pth file as expected from the documentation pytorch examples (e.g. cifar).

Here’s tune.run:

analysis = tune.run(
    trainKarateClub,
    num_samples=12,
    scheduler=ASHAScheduler(metric="mean_accuracy", mode="max"),
    config=search_space
)

Here’s the code after tune.run has completed.

import os

best_logdir = analysis.get_best_logdir('mean_accuracy', 'max')  # Get best trial's logdir
state_dict = torch.load(os.path.join(best_logdir, "graphsage.pth"))
best_config = analysis.get_best_config('mean_accuracy', 'max')  # Get best trial's hyperparameters

best_graphsage = GraphSAGE(dataset.num_features, best_config['num_hidden'], dataset.num_classes, best_config['optimizer'])
best_graphsage.load_state_dict(state_dict)

fails with error

FileNotFoundError: [Errno 2] No such file or directory: '/root/ray_results/trainKarateClub_2022-04-24_03-53-43/trainKarateClub_229b2_00132_132_S1=3,S2=5,batch_size=4,epochs=10,num_hidden=64,lr=0.1,weight_decay=0.0005_2022-04-24_04-01-04/graphsage.pth'

When I look in the folder, it exists, and I find the following files:

checkpoint_000000  checkpoint_000008
checkpoint_000001  checkpoint_000009
checkpoint_000002  events.out.tfevents.1650772864.90fd411bbef8
checkpoint_000003  params.json
checkpoint_000004  params.pkl
checkpoint_000005  progress.csv
checkpoint_000006  result.json
checkpoint_000007

Please help because I cannot load my model parameters and test it without that pth.

Hello,
Could you share your trainKarateClub function?

def trainKarateClub(config, checkpoint_dir=None):

  # data setup

  dataset, data = getKarateClub()

  # train_loader

  hop_sizes=config['hop_sizes']
  batch_size=config['batch_size']

  train_loader = getTrainLoader(data, hop_sizes, batch_size)

  # model
  graphsage = GraphSAGE(dataset.num_features, config['num_hidden'], dataset.num_classes, config['optimizer'])

  # criterion, optimizer
  criterion = torch.nn.CrossEntropyLoss()
  optimizer = torch.optim.Adam(graphsage.parameters(), lr=config['optimizer']['lr'], weight_decay=config['optimizer']['weight_decay'])

  # The `checkpoint_dir` parameter gets passed by Ray Tune when a checkpoint
  # should be restored.
  if checkpoint_dir:
      checkpoint = os.path.join(checkpoint_dir, "checkpoint")
      model_state, optimizer_state = torch.load(checkpoint)
      graphsage.load_state_dict(model_state)
      optimizer.load_state_dict(optimizer_state)


  # train

  for epoch in range(config['epochs']):
    _ = graphsage.fitOneEpoch(data, optimizer, criterion) 
    acc=test(graphsage, data)

    # Here we save a checkpoint. It is automatically registered with
    # Ray Tune and will potentially be passed as the `checkpoint_dir`
    # parameter in future iterations.
    with tune.checkpoint_dir(step=epoch) as checkpoint_dir:
        path = os.path.join(checkpoint_dir, "checkpoint")
        torch.save((graphsage.state_dict(), optimizer.state_dict()), path)

    # Send the current training result back to Tune
    tune.report(mean_accuracy=acc)

    # if i % 5 == 0:
    #   # This saves the model to the trial directory
    #   torch.save(graphsage.state_dict(), "./graphsage.pth")

Now that I’m looking at this, I wonder if my problem doesn’t come from the 3 last lines that are commented for some reason?

Sorry which 3 lines are you referring to?

Also can you make sure that graphsage.pth is saved in trainKarateClub function?