Display trials score

bouachalazhar · August 26, 2024, 1:20pm

How just print final table with all trials and scores without download data message or checkpoint?

Medium: I have 270 trials and I saw that :

2024-08-26 12:57:59,499	INFO worker.py:1781 -- Started a local Ray instance.
+-------------------------------------------------------------------+
| Configuration for experiment     TrainMNIST_2024-08-26_12-58-02   |
+-------------------------------------------------------------------+
| Search algorithm                 BasicVariantGenerator            |
| Scheduler                        AsyncHyperBandScheduler          |
| Number of trials                 270                              |
+-------------------------------------------------------------------+

View detailed results here: /root/ray_results/TrainMNIST_2024-08-26_12-58-02
To visualize your results with TensorBoard, run: `tensorboard --logdir /tmp/ray/session_2024-08-26_12-57-53_513810_235/artifacts/2024-08-26_12-58-02/TrainMNIST_2024-08-26_12-58-02/driver_artifacts`

Trial status: 200 PENDING
Current time: 2024-08-26 12:58:07. Total running time: 3s
Logical resource usage: 0/2 CPUs, 0/0 GPUs
+---------------------------------------------------------+
| Trial name               status       batch_size     lr |
+---------------------------------------------------------+
| TrainMNIST_d4276_00000   PENDING              16      1 |
| TrainMNIST_d4276_00001   PENDING              32      1 |
| TrainMNIST_d4276_00002   PENDING              64      1 |
| TrainMNIST_d4276_00003   PENDING             128      1 |
| TrainMNIST_d4276_00004   PENDING             256      1 |
+---------------------------------------------------------+
195 more PENDING
(TrainMNIST pid=850) Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
(TrainMNIST pid=850) Failed to download (trying next):
(TrainMNIST pid=850) HTTP Error 403: Forbidden
(TrainMNIST pid=850) 
(TrainMNIST pid=850) Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
(TrainMNIST pid=850) Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to /root/data/MNIST/raw/train-images-idx3-ubyte.gz
  0%|          | 0/9912422 [00:00<?, ?it/s]
  1%|          | 65536/9912422 [00:00<00:18, 525697.78it/s]
  2%|▏         | 196608/9912422 [00:00<00:11, 820402.81it/s]
  7%|▋         | 688128/9912422 [00:00<00:04, 2207103.31it/s]
 23%|██▎       | 2260992/9912422 [00:00<00:01, 6200564.23it/s]
100%|██████████| 9912422/9912422 [00:00<00:00, 15311955.73it/s]
(TrainMNIST pid=850) Extracting /root/data/MNIST/raw/train-images-idx3-ubyte.gz to /root/data/MNIST/raw
(TrainMNIST pid=850) 
(TrainMNIST pid=850) Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
(TrainMNIST pid=850) Failed to download (trying next):
(TrainMNIST pid=850) HTTP Error 403: Forbidden
(TrainMNIST pid=850) 
(TrainMNIST pid=850) Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz
(TrainMNIST pid=850) Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to /root/data/MNIST/raw/train-labels-idx1-ubyte.gz
  0%|          | 0/28881 [00:00<?, ?it/s]
100%|██████████| 28881/28881 [00:00<00:00, 450953.92it/s]
(TrainMNIST pid=850) Extracting /root/data/MNIST/raw/train-labels-idx1-ubyte.gz to /root/data/MNIST/raw
(TrainMNIST pid=850) 
(TrainMNIST pid=850) Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
(TrainMNIST pid=850) Failed to download (trying next):
(TrainMNIST pid=850) HTTP Error 403: Forbidden
(TrainMNIST pid=850) 
(TrainMNIST pid=850) Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz
(TrainMNIST pid=850) Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to /root/data/MNIST/raw/t10k-images-idx3-ubyte.gz
  0%|          | 0/1648877 [00:00<?, ?it/s]
  4%|▍         | 65536/1648877 [00:00<00:03, 516045.65it/s]
 18%|█▊        | 294912/1648877 [00:00<00:01, 1270899.86it/s]
100%|██████████| 1648877/1648877 [00:00<00:00, 4290494.75it/s]
(TrainMNIST pid=850) Extracting /root/data/MNIST/raw/t10k-images-idx3-ubyte.gz to /root/data/MNIST/raw
(TrainMNIST pid=850) 
(TrainMNIST pid=850) Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
(TrainMNIST pid=850) Failed to download (trying next):
(TrainMNIST pid=850) HTTP Error 403: Forbidden
(TrainMNIST pid=850) 
(TrainMNIST pid=850) Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz
(TrainMNIST pid=850) Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to /root/data/MNIST/raw/t10k-labels-idx1-ubyte.gz
(TrainMNIST pid=850) Extracting /root/data/MNIST/raw/t10k-labels-idx1-ubyte.gz to /root/data/MNIST/raw
(TrainMNIST pid=850) 

Trial TrainMNIST_d4276_00000 started with configuration:
+--------------------------------------------+
| Trial TrainMNIST_d4276_00000 config        |
+--------------------------------------------+
| batch_size                              16 |
| lr                                       1 |
+--------------------------------------------+
100%|██████████| 4542/4542 [00:00<00:00, 11206193.39it/s]
(TrainMNIST pid=850) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/root/ray_results/TrainMNIST_2024-08-26_12-58-02/TrainMNIST_d4276_00000_0_batch_size=16,lr=1_2024-08-26_12-58-05/checkpoint_000000)

Trial TrainMNIST_d4276_00000 completed after 5 iterations at 2024-08-26 12:58:31. Total running time: 27s

zmin1217 · September 9, 2024, 3:43am

You need to customize the implementation of a TuneReporterBase class，and can delete information you don’t want to display.

bouachalazhar · September 9, 2024, 12:26pm

I want to create tune.Trainable class for train_func_per_worker but I never saw an example.

import argparse
import os
import tempfile
from typing import Dict

import ray
import ray.cloudpickle as cpickle
import ray.train
import torch
import torch.nn as nn
from filelock import FileLock
from ray import tune
from ray.train import Checkpoint, FailureConfig, RunConfig, ScalingConfig
from ray.train.torch import TorchTrainer
from ray.tune.schedulers import PopulationBasedTraining
from ray.tune.tune_config import TuneConfig
from ray.tune.tuner import Tuner
from torch import nn
from torch.optim.adamw import AdamW
from torch.utils.data import DataLoader
from torchvision.datasets import CIFAR10
from torchvision.transforms import v2


def get_dataloaders(config):

    transform = v2.Compose(
        [
            v2.ToImage(),
            v2.ToDtype(torch.float32, scale=True),
            v2.Normalize((0.4914, 0.4822, 0.4465),
                         (0.2023, 0.1994, 0.2010)),
        ]
    )

    data_dir = config.get("data_dir")
    if not os.path.exists(data_dir):
        os.makedirs(data_dir, exist_ok=True)
    with FileLock(os.path.join(data_dir, ".ray.lock")):
        train_dataset = CIFAR10(
            root=data_dir, train=True, download=True, transform=transform
        )
        validation_dataset = CIFAR10(
            root=data_dir, train=False, download=False, transform=transform
        )

    worker_batch_size = config["batch_size"] // ray.train.get_context().get_world_size()

    train_loader = DataLoader(
        train_dataset, batch_size=worker_batch_size, shuffle=True)
    validation_loader = DataLoader(
        validation_dataset, batch_size=worker_batch_size)

    train_loader = ray.train.torch.prepare_data_loader(train_loader)
    validation_loader = ray.train.torch.prepare_data_loader(validation_loader)

    return train_loader, validation_loader


class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(3 * 32 * 32, 512),
            nn.ReLU(),
            nn.Dropout(0.25),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Dropout(0.25),
            nn.Linear(512, 10),
            nn.ReLU(),
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits


def train_epoch(epoch, dataloader, model, loss_fn, optimizer):
    if ray.train.get_context().get_world_size() > 1:
        dataloader.sampler.set_epoch(epoch)

    size = len(dataloader.dataset) // ray.train.get_context().get_world_size()
    num_batches = len(dataloader)
    train_loss, train_acc = 0, 0
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        output = model(X)
        loss = loss_fn(output, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        train_loss += loss.item()
        y_pred = output.argmax(1, keepdim=True)
        train_acc += y_pred.eq(y.view_as(y_pred)).sum().item()

    train_loss /= num_batches
    train_acc /= size
    return {"Train Loss": train_loss, "Train Accuracy": train_acc}


@torch.inference_mode()
def validate_epoch(dataloader, model, loss_fn):
    size = len(dataloader.dataset) // ray.train.get_context().get_world_size()
    num_batches = len(dataloader)
    test_loss, test_acc = 0, 0
    model.eval()
    for X, y in dataloader:
        output = model(X)
        loss = loss_fn(output, y)

        test_loss += loss.item()
        y_pred = output.argmax(1, keepdim=True)
        test_acc += y_pred.eq(y.view_as(y_pred)).sum().item()

    test_loss /= num_batches
    test_acc /= size
    return {"Test Loss": test_loss, "Test Accuracy": test_acc}


def update_optimizer_config(optimizer, config):
    for param_group in optimizer.param_groups:
        for param, val in config.items():
            param_group[param] = val


def train_func_per_worker(config: Dict):
    lr = config.get("lr")
    epochs = config.get("epochs")

    train_dataloader, test_dataloader = get_dataloaders(config)

    model = NeuralNetwork()
    if not ray.train.get_checkpoint():
        model = ray.train.torch.prepare_model(model)

    loss_fn = nn.CrossEntropyLoss()
    optimizer_config = {
        "lr": lr,
    }
    optimizer = AdamW(model.parameters(), **optimizer_config)

    starting_epoch = 1
    if ray.train.get_checkpoint():
        with ray.train.get_checkpoint().as_directory() as checkpoint_dir:
            with open(os.path.join(checkpoint_dir, "data.ckpt"), "rb") as fp:
                checkpoint_dict = cpickle.load(fp)

        # Load in model
        model_state = checkpoint_dict["model"]
        model.load_state_dict(model_state)
        model = ray.train.torch.prepare_model(model)

        # Load in optimizer
        optimizer_state = checkpoint_dict["optimizer_state_dict"]
        optimizer.load_state_dict(optimizer_state)

        update_optimizer_config(optimizer, optimizer_config)

        # The current epoch increments the loaded epoch by 1
        checkpoint_epoch = checkpoint_dict["epoch"]
        starting_epoch = checkpoint_epoch + 1

    for epoch in range(starting_epoch, epochs+1):

        train_result = train_epoch(
            epoch, train_dataloader, model, loss_fn, optimizer)
        test_result = validate_epoch(test_dataloader, model, loss_fn)

        ray.train.report(metrics={
                         "loss": test_result["Test Loss"], "accuracy": test_result["Test Accuracy"]})


        with tempfile.TemporaryDirectory() as checkpoint_dir:
            with open(os.path.join(checkpoint_dir, "data.ckpt"), "wb") as fp:
                cpickle.dump(
                    {
                        "epoch": epoch,
                        "model": model.module.state_dict(),
                        "optimizer_state_dict": optimizer.state_dict(),
                    },
                    fp,
                )
            checkpoint = Checkpoint.from_directory(checkpoint_dir)
            ray.train.report(test_result, checkpoint=checkpoint)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--address", required=False, type=str,
                        help="The address to use for Redis.")
    parser.add_argument(
        "--num-workers",
        "-n",
        type=int,
        default=os.cpu_count(),
        help="Sets number of workers for training.",
    )
    parser.add_argument(
        "--num-epochs", type=int, default=2, help="Number of epochs to train."
    )

    parser.add_argument(
        "--num-samples", type=int, default=2, help="Number of samples to run."
    )

    parser.add_argument(
        "--use-gpu", action="store_true", default=False, help="Enables GPU training."
    )
    parser.add_argument(
        "--data-dir",
        required=False,
        type=str,
        default="~/data",
        help="Root directory for storing downloaded dataset.",
    )
    parser.add_argument(
        "--synch", action="store_true", default=False, help="Use synchronous PBT."
    )

    args, _ = parser.parse_known_args()

    # ray.init(ignore_reinit_error=True, log_to_driver=False)
    ray.init(address=args.address, ignore_reinit_error=True)

    param_space = {
        "train_loop_config": {
            "lr": tune.grid_search([0.001, 0.01, 0.05, 0.1]),
            "batch_size": args.num_workers,
            "data_dir": args.data_dir,
            "epochs": args.num_epochs,
        }
    }

    train_loop_config = {
        "lr": 1e-3,
        "epochs": 2,
        "batch_size": 32,
        "data_dir": args.data_dir,
    }

    scaling_config = ScalingConfig(
        num_workers=args.num_workers, use_gpu=args.use_gpu)

    scheduler = PopulationBasedTraining(
        time_attr="training_iteration",
        perturbation_interval=1,
        hyperparam_mutations={
            "train_loop_config": {
                "lr": tune.loguniform(0.001, 0.1),
            }
        },
        synch=args.synch,
    )

    trainer = TorchTrainer(
        train_loop_per_worker=train_func_per_worker,
        train_loop_config=train_loop_config,
        scaling_config=scaling_config,
    )

    tuner = Tuner(
        trainable=trainer,
        param_space=param_space,
        tune_config=TuneConfig(
            num_samples=args.num_samples, metric="loss", mode="min", scheduler=scheduler, reuse_actors=True
        ),
        run_config=RunConfig(
            stop={"training_iteration": args.num_epochs},
            failure_config=FailureConfig(max_failures=3),
        ),
    )

    results = tuner.fit()

    print(results.get_best_result(metric="loss", mode="min"))

Topic		Replies	Views
Trouble with some results from Ray Tune	1	42	August 7, 2024
Saving gym render video to tensorboard Ray Tune	0	685	July 21, 2021
Could not find best trial. Did you pass the correct `metric` parameter? Ray Tune	3	1443	December 17, 2021
All trials PENDING, never RUNNING Ray Tune	14	2394	July 14, 2021
Question - About tune stopping condition with PBT	6	502	February 21, 2023

Display trials score

Related topics