Where are results saved during Ray Tune experiments?

I am a beginner of Ray Tune. I was playing basic functionalities of Ray Tune with CIFAR 10 dataset following some tutorial.

Here are the things I observed

  • In “datasets.CIFAR10(root=data_dir, train=True, download=True)”, I get an error when download is set to False
  • After running the code, I cannot find the directory where CIFAR 10 dataset is saved
  • torch.save(“model.pt”) doesn’t actually save

Given these observations, it seems like that all the byproducts generated during tuning in search space are not saved in local disk. And each experiment has their independent space. I think it save results some where in memory but I am not sure.

If you are familiar with Ray Tune, please let me know how files are saved.

Hi @BaamPark,

Thanks for making this detailed post!

Q1: Where are the results logged to?

The default logging directory is set to the ~/ray_results/ in the local filesystem. This can be modified through ray.air.RunConfig(storage_path) passed into the tune.Tuner. See Tune Execution (tune.Tuner) — Ray 2.3.1 and ray.air.RunConfig — Ray 2.3.1.

By default, the experiment directory name will be uniquely generated with a timestamp, but this can also be set manually (to something meaningful) withRunConfig(name).

Within the actual Tune run, we download the dataset to "./data". This is a relative path, which defaults to be relative to the trial directory. This means that each trial (e.g., you are performing a search using Tune search spaces) will download the data separately.

If you want to use a single download folder across all trials instead, you can set data_dir="/some/shared/directory".

Q2: Where is the checkpoint?

You also mentioned that torch.save is happening, but no ckpt.pt file is found. This is because we’re reporting the checkpoint file to Tune directly in Tune’s checkpointing logic:

# From https://docs.ray.io/en/latest/tune/examples/tune-pytorch-cifar.html

torch.save(
    (net.state_dict(), optimizer.state_dict()), "my_model/checkpoint.pt"
)
checkpoint = Checkpoint.from_directory("my_model")
session.report(
    {"loss": (val_loss / val_steps), "accuracy": correct / total},
    checkpoint=checkpoint,
)

Notice that we pass the checkpoint directory my_model that contains the .pt file to Tune as a ray.air.Checkpoint object. This will read in the checkpoint data and package it in a "checkpoint_0000x" folder within the trial directory.

You can access the contents after the run with these two methods:

  1. ray.air.checkpoint.Checkpoint.from_directory — Ray 2.3.1
  2. ray.air.checkpoint.Checkpoint.as_directory — Ray 2.3.1

Example:

import os

from ray.air import Checkpoint

checkpoint: Checkpoint = Checkpoint.from_directory("/path/to/checkpoint_folder")
with checkpoint.as_directory() as d:
    with open(os.path.join(d, "checkpoint.pt")) as f:
        # Use the checkpoint
        pass

See Ray AIR Checkpoint — Ray 2.3.1 for more details.

More resources

The Tune user guides User Guides — Ray 3.0.0.dev0 maybe useful to read through. In particular:

These defaults are definitely not obvious, so thanks for creating this question. I’ll create a Github issue to add some of these explanations to our user guides/documentation. Let me know if these work for you, and if you have any other questions.

Thank you so much for your reply! This helps me a lot.