Where are results saved during Ray Tune experiments?

BaamPark · April 15, 2023, 6:05am

I am a beginner of Ray Tune. I was playing basic functionalities of Ray Tune with CIFAR 10 dataset following some tutorial.

Here are the things I observed

In “datasets.CIFAR10(root=data_dir, train=True, download=True)”, I get an error when download is set to False
After running the code, I cannot find the directory where CIFAR 10 dataset is saved
torch.save(“model.pt”) doesn’t actually save

Given these observations, it seems like that all the byproducts generated during tuning in search space are not saved in local disk. And each experiment has their independent space. I think it save results some where in memory but I am not sure.

If you are familiar with Ray Tune, please let me know how files are saved.

justinvyu · April 17, 2023, 4:41pm

Hi @BaamPark,

Thanks for making this detailed post!

Q1: Where are the results logged to?

The default logging directory is set to the ~/ray_results/ in the local filesystem. This can be modified through ray.air.RunConfig(storage_path) passed into the tune.Tuner. See Tune Execution (tune.Tuner) — Ray 2.3.1 and ray.air.RunConfig — Ray 2.3.1.

By default, the experiment directory name will be uniquely generated with a timestamp, but this can also be set manually (to something meaningful) withRunConfig(name).

Within the actual Tune run, we download the dataset to "./data". This is a relative path, which defaults to be relative to the trial directory. This means that each trial (e.g., you are performing a search using Tune search spaces) will download the data separately.

If you want to use a single download folder across all trials instead, you can set data_dir="/some/shared/directory".

Q2: Where is the checkpoint?

You also mentioned that torch.save is happening, but no ckpt.pt file is found. This is because we’re reporting the checkpoint file to Tune directly in Tune’s checkpointing logic:

# From https://docs.ray.io/en/latest/tune/examples/tune-pytorch-cifar.html

torch.save(
    (net.state_dict(), optimizer.state_dict()), "my_model/checkpoint.pt"
)
checkpoint = Checkpoint.from_directory("my_model")
session.report(
    {"loss": (val_loss / val_steps), "accuracy": correct / total},
    checkpoint=checkpoint,
)

Notice that we pass the checkpoint directory my_model that contains the .pt file to Tune as a ray.air.Checkpoint object. This will read in the checkpoint data and package it in a "checkpoint_0000x" folder within the trial directory.

You can access the contents after the run with these two methods:

Example:

import os

from ray.air import Checkpoint

checkpoint: Checkpoint = Checkpoint.from_directory("/path/to/checkpoint_folder")
with checkpoint.as_directory() as d:
    with open(os.path.join(d, "checkpoint.pt")) as f:
        # Use the checkpoint
        pass

See Ray AIR Checkpoint — Ray 2.3.1 for more details.

More resources

The Tune user guides User Guides — Ray 3.0.0.dev0 maybe useful to read through. In particular:

How to Save and Load Trial Checkpoints — Ray 3.0.0.dev0

These defaults are definitely not obvious, so thanks for creating this question. I’ll create a Github issue to add some of these explanations to our user guides/documentation. Let me know if these work for you, and if you have any other questions.

BaamPark · April 18, 2023, 2:11pm

Thank you so much for your reply! This helps me a lot.

Topic		Replies	Views
Tune results saved in ~/ray_results in addition to local storage_dir if TUNE_RESULT_DIR not set Ray Tune	5	1036	March 14, 2024
Trouble with some results from Ray Tune	1	42	August 7, 2024
Ray Tune - how to load trial results from a different location?	2	414	October 23, 2023
Add trials to experiment for later analysis	3	319	July 5, 2023
How to set directory where checkpoints are saved	2	522	December 14, 2023

Where are results saved during Ray Tune experiments?

Q1: Where are the results logged to?

Q2: Where is the checkpoint?

More resources

Related topics