Ray Tune on GCP cluster: checkpoint not found after successful sync down

marta_hum · April 6, 2021, 10:30am

TL;DR: Ray Tune on GCP cluster fails to sync down last checkpoint

Hi,

I am trying the Ray tune mnist_pytorch example (mnist_pytorch — Ray v2.0.0.dev0) on a GCP cluster,
and I have added the checkpointing code in my training loop using tune.checkpoint_dir.
However, I keep encountering the error below, which reports a failure in synchronising the last iteration checkpoint between the worker and the head node:

2021-04-06 09:51:12,876	ERROR syncer.py:190 -- Sync execution failed.
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/ray/tune/syncer.py", line 186, in sync_down
    result = self.sync_client.sync_down(self._remote_path,
  File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/ray/tune/sync_client.py", line 212, in sync_down
    return self._execute(self.sync_down_template, source, target)
  File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/ray/tune/sync_client.py", line 271, in _execute
    stdout=self._get_logfile())
  File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/ray/tune/sync_client.py", line 201, in _get_logfile
    raise RuntimeError(
RuntimeError: [internalerror] The client has been closed. Please report this stacktrace + your cluster configuration on Github!
2021-04-06 09:51:12,877	ERROR syncer.py:413 -- Trial train_mnist_75714_00001: Checkpoint sync skipped. This should not happen.
2021-04-06 09:51:12,878	ERROR trial_runner.py:899 -- Trial train_mnist_75714_00001: Error handling checkpoint /home/ubuntu/ray_results/exp/train_mnist_75714_00001_1_lr=0.00026945,momentum=0.36098_2021-04-06_09-50-03/checkpoint_000100/
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 891, in _process_trial_save
    self._callbacks.on_checkpoint(
  File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/ray/tune/callback.py", line 216, in on_checkpoint
    callback.on_checkpoint(**info)
  File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/ray/tune/syncer.py", line 455, in on_checkpoint
    self._sync_trial_checkpoint(trial, checkpoint)
  File "/home/ubuntu/anaconda3/lib/python3.8/site-packages/ray/tune/syncer.py", line 428, in _sync_trial_checkpoint
    raise TuneError("Trial {}: Checkpoint path {} not "
ray.tune.error.TuneError: Trial train_mnist_75714_00001: Checkpoint path /home/ubuntu/ray_results/exp/train_mnist_75714_00001_1_lr=0.00026945,momentum=0.36098_2021-04-06_09-50-03/checkpoint_000100/ not found after successful sync down.

As a result, the last checkpoint at the end of the training is missing.
I have experienced the same issue also with other examples I have developed, and the same error also happens when a trial is early stopped before reaching the end of training (e.g. if interrupted by tune when the performance is worse than other trials).

Could you help me understand why this happens and how to prevent it, please?

Here is the script mnist_pytorch.py I am running:

import os
import argparse
from filelock import FileLock
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms

import ray
from ray import tune
from ray.tune.schedulers import AsyncHyperBandScheduler

# Change these values if you want the training to run quicker or slower.
EPOCH_SIZE = 512
TEST_SIZE = 256


class ConvNet(nn.Module):
    def __init__(self):
        super(ConvNet, self).__init__()
        self.conv1 = nn.Conv2d(1, 3, kernel_size=3)
        self.fc = nn.Linear(192, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 3))
        x = x.view(-1, 192)
        x = self.fc(x)
        return F.log_softmax(x, dim=1)


def train(model, optimizer, train_loader, device=None):
    device = device or torch.device("cpu")
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        if batch_idx * len(data) > EPOCH_SIZE:
            return
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()


def test(model, data_loader, device=None):
    device = device or torch.device("cpu")
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for batch_idx, (data, target) in enumerate(data_loader):
            if batch_idx * len(data) > TEST_SIZE:
                break
            data, target = data.to(device), target.to(device)
            outputs = model(data)
            _, predicted = torch.max(outputs.data, 1)
            total += target.size(0)
            correct += (predicted == target).sum().item()

    return correct / total


def get_data_loaders():
    mnist_transforms = transforms.Compose(
        [transforms.ToTensor(),
         transforms.Normalize((0.1307, ), (0.3081, ))])

    # We add FileLock here because multiple workers will want to
    # download data, and this may cause overwrites since
    # DataLoader is not threadsafe.
    with FileLock(os.path.expanduser("~/data.lock")):
        train_loader = torch.utils.data.DataLoader(
            datasets.KMNIST(
                "~/data",
                train=True,
                download=True,
                transform=mnist_transforms),
            batch_size=64,
            shuffle=True)
    test_loader = torch.utils.data.DataLoader(
        datasets.KMNIST("~/data", train=False, transform=mnist_transforms),
        batch_size=64,
        shuffle=True)
    return train_loader, test_loader


def train_mnist(config, checkpoint_dir=None, save_every_n=2):
    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")
    train_loader, test_loader = get_data_loaders()
    model = ConvNet().to(device)

    optimizer = optim.SGD(
        model.parameters(), lr=config["lr"], momentum=config["momentum"])

    if checkpoint_dir:
        checkpoint_file = os.path.join(checkpoint_dir, "checkpoint.pth")
        model_state, optimizer_state = torch.load(checkpoint_file)
        model.load_state_dict(model_state)
        optimizer.load_state_dict(optimizer_state)

    step = 0
    while True:
        train(model, optimizer, train_loader, device)
        acc = test(model, test_loader, device)

        step += 1
        if step % save_every_n == 0:
            with tune.checkpoint_dir(step=step) as checkpoint_dir:
                path = os.path.join(checkpoint_dir, "checkpoint.pth")
                torch.save((model.state_dict(), optimizer.state_dict()), path)

        # Set this to run Tune.
        tune.report(mean_accuracy=acc)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="PyTorch MNIST Example")
    parser.add_argument(
        "--cuda",
        action="store_true",
        default=False,
        help="Enables GPU training")
    parser.add_argument(
        "--smoke-test", action="store_true", help="Finish quickly for testing")
    args = parser.parse_args()

    try:
        ray.init(address="auto")
    except:
        ray.init()

    # for early stopping
    sched = AsyncHyperBandScheduler()

    analysis = tune.run(
        train_mnist,
        metric="mean_accuracy",
        mode="max",
        name="exp",
        scheduler=sched,
        stop={
            "mean_accuracy": 0.98,
            "training_iteration": 5 if args.smoke_test else 100
        },
        resources_per_trial={
            "cpu": 4,
            "gpu": int(args.cuda)  # set this for GPUs
        },
        num_samples=1 if args.smoke_test else 20,
        config={
            "lr": tune.loguniform(1e-4, 1e-2),
            "momentum": tune.uniform(0.1, 0.9),
        })

    print("Best config is:", analysis.best_config)

(this is the same script as in mnist_pytorch — Ray v2.0.0.dev0, expect for the addition of the checkpointing, the use of kmnist instead of mnist dataset and the required resourcers per trial - no GPU is used).

Below is the yaml file to configure the GCP cluster:

# An unique identifier for the head node and workers of this cluster.
cluster_name: clustername

# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
min_workers: 2

# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers.
max_workers: 2

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 1.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker: {}

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
   type: gcp
   region: europe-west1
   availability_zone: europe-west1-b
   project_id: <my_project>

# How Ray will authenticate with newly launched nodes.
auth:
   ssh_user: ubuntu
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below. This requires that you have added the key into the
# project wide meta-data.
#    ssh_private_key: /path/to/your/key.pem

# Provider-specific config for the head node, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as subnets and ssh-keys.
# For more documentation on available fields, see:
# https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
head_node:
   machineType: n1-standard-2
   disks:
     - boot: true
       autoDelete: true
       type: PERSISTENT
       initializeParams:
         diskSizeGb: 50
         # See https://cloud.google.com/compute/docs/images for more images
         sourceImage: projects/deeplearning-platform-release/global/images/family/pytorch-latest-cu101-debian-10

   # Additional options can be found in in the compute docs at
   # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert

   # If the network interface is specified as below in both head and worker
   # nodes, the manual network config is used.  Otherwise an existing subnet is
   # used.  To use a shared subnet, ask the subnet owner to grant permission
   # for 'compute.subnetworks.use' to the ray autoscaler account...
   # networkInterfaces:
   #   - kind: compute#networkInterface
   #     subnetwork: path/to/subnet
   #     aliasIpRanges: []

worker_nodes:
   machineType: n1-standard-4
   disks:
     - boot: true
       autoDelete: true
       type: PERSISTENT
       initializeParams:
         diskSizeGb: 50
         # See https://cloud.google.com/compute/docs/images for more images
         sourceImage: projects/deeplearning-platform-release/global/images/family/pytorch-latest-cu101-debian-10
   scheduling:
     - preemptible: false
       onHostMaintenance: TERMINATE

   # Additional options can be found in in the compute docs at
   # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
 "./mnist_pytorch.py": "./mnist_pytorch.py",
#    "/path1/on/remote/machine": "/path1/on/local/machine",
}

# Files or directories to copy from the head node to the worker nodes. The format is a
# list of paths. The same path on the head node will be copied to the worker node.
# This behavior is a subset of the file_mounts behavior. In the vast majority of cases
# you should just use file_mounts. Only use this if you know what you're doing!
cluster_synced_files: []

# Whether changes to directories in file_mounts or cluster_synced_files in the head node
# should sync to the worker node continuously
file_mounts_sync_continuously: False

# Patterns for files to exclude when running rsync up or rsync down
rsync_exclude: []

# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for
# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided
# as a value, the behavior will match git's behavior for finding and using .gitignore files.
rsync_filter: []

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: []

# List of shell commands to run to set up nodes.
setup_commands:
   # Note: if you're developing Ray, you probably want to create an AMI that
   # has your Ray repo pre-cloned. Then, you can replace the pip installs
   # below with a git checkout <your_sha> (and possibly a recompile).
   # - echo 'export PATH="$HOME/anaconda3/envs/tensorflow_p36/bin:$PATH"' >> ~/.bashrc

   # Install MiniConda.
   - >-
     sudo apt install -y build-essential
     && wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/anaconda3.sh
     || true
     && bash ~/anaconda3.sh -b -p ~/anaconda3 || true
     && rm ~/anaconda3.sh
     && echo 'export PATH="$HOME/anaconda3/bin:$PATH"' >> ~/.profile

   # Install ray
   - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl
   - pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
   - pip install pandas ray[tune] ax-platform sqlalchemy scikit-optimize tensorboard tensorboardX

# Custom commands that will be run on the head node after common setup.
head_setup_commands:
 - pip install google-api-python-client==1.7.8

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
   - ray stop
   - >-
     ulimit -n 65536;
     ray start
     --head
     --port=6379
     --object-manager-port=8076
     --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
   - ray stop
   - >-
     ulimit -n 65536;
     ray start
     --address=$RAY_HEAD_IP:6379
     --object-manager-port=8076

Many thanks.

rliaw · April 7, 2021, 2:03am

Hi @marta_hum does this happen even without the Scheduler?

marta_hum · April 7, 2021, 9:23am

Hi @rliaw, thanks for your reply. Yes, if I set the scheduler to None, the FIFO scheduler gets used by default and I still get the same error message as soon as it reaches the maximum number of iterations per trial.

rliaw · April 9, 2021, 7:53am

Is it possible that one of the nodes is being released/deprovisioned before syncing is finished?

marta_hum · April 9, 2021, 8:32am

I guess that could be. How could I test this?
However, it does happen with all the trials as soon as they reach the max number of iterations, even with the simple mnist example I reported above (which is essentially the same reported in the Ray docs). So there may be something weird happening when the trials are signalled as terminated, perhaps?

Do you have any suggestions on how I can make sure the last checkpoint is correctly synced, please?

rliaw · April 9, 2021, 9:09pm

Are there any logs in ~/ray_results/train_mnist for this?

marta_hum · April 13, 2021, 9:16am

Yes, within each trial folder there is a log file (log_sync_out***.log), they all look like this one below:

receiving incremental file list
./
stderr
stdout
checkpoint_000002/
checkpoint_000002/.is_checkpoint
checkpoint_000002/.tune_metadata
checkpoint_000002/checkpoint.pth

sent 134 bytes  received 17,455 bytes  11,726.00 bytes/sec
total size is 22,450  speedup is 1.28
receiving incremental file list

sent 21 bytes  received 214 bytes  470.00 bytes/sec
total size is 22,450  speedup is 95.53
receiving incremental file list
./
checkpoint_000004/
checkpoint_000004/.is_checkpoint
checkpoint_000004/.tune_metadata
checkpoint_000004/checkpoint.pth

sent 93 bytes  received 16,478 bytes  33,142.00 bytes/sec
total size is 41,193  speedup is 2.49
receiving incremental file list
./
checkpoint_000006/
checkpoint_000006/.is_checkpoint
checkpoint_000006/.tune_metadata
checkpoint_000006/checkpoint.pth

sent 94 bytes  received 16,609 bytes  33,406.00 bytes/sec
total size is 59,936  speedup is 3.59
receiving incremental file list
./
checkpoint_000008/
checkpoint_000008/.is_checkpoint
checkpoint_000008/.tune_metadata
checkpoint_000008/checkpoint.pth

sent 95 bytes  received 16,699 bytes  33,588.00 bytes/sec
total size is 78,679  speedup is 4.68
receiving incremental file list
./
checkpoint_000010/
checkpoint_000010/.is_checkpoint
checkpoint_000010/.tune_metadata
checkpoint_000010/checkpoint.pth

sent 96 bytes  received 16,798 bytes  33,788.00 bytes/sec
total size is 97,422  speedup is 5.77
receiving incremental file list
./
checkpoint_000012/
checkpoint_000012/.is_checkpoint
checkpoint_000012/.tune_metadata
checkpoint_000012/checkpoint.pth

sent 97 bytes  received 16,958 bytes  11,370.00 bytes/sec
total size is 116,165  speedup is 6.81
receiving incremental file list
./
checkpoint_000014/
checkpoint_000014/.is_checkpoint
checkpoint_000014/.tune_metadata
checkpoint_000014/checkpoint.pth

sent 98 bytes  received 17,041 bytes  11,426.00 bytes/sec
total size is 134,908  speedup is 7.87
receiving incremental file list
./
checkpoint_000016/
checkpoint_000016/.is_checkpoint
checkpoint_000016/.tune_metadata
checkpoint_000016/checkpoint.pth

sent 99 bytes  received 17,166 bytes  34,530.00 bytes/sec
total size is 153,651  speedup is 8.90
receiving incremental file list
./
stderr
checkpoint_000018/
checkpoint_000018/.is_checkpoint
checkpoint_000018/.tune_metadata
checkpoint_000018/checkpoint.pth

sent 147 bytes  received 17,506 bytes  35,306.00 bytes/sec
total size is 172,459  speedup is 9.77
receiving incremental file list
./
checkpoint_000020/
checkpoint_000020/.is_checkpoint
checkpoint_000020/.tune_metadata
checkpoint_000020/checkpoint.pth

sent 101 bytes  received 17,314 bytes  34,830.00 bytes/sec
total size is 191,202  speedup is 10.98
receiving incremental file list
./
checkpoint_000022/
checkpoint_000022/.is_checkpoint
checkpoint_000022/.tune_metadata
checkpoint_000022/checkpoint.pth

sent 102 bytes  received 17,461 bytes  35,126.00 bytes/sec
total size is 209,945  speedup is 11.95
receiving incremental file list
./
checkpoint_000024/
checkpoint_000024/.is_checkpoint
checkpoint_000024/.tune_metadata
checkpoint_000024/checkpoint.pth

sent 103 bytes  received 17,555 bytes  35,316.00 bytes/sec
total size is 228,688  speedup is 12.95
receiving incremental file list
./
checkpoint_000026/
checkpoint_000026/.is_checkpoint
checkpoint_000026/.tune_metadata
checkpoint_000026/checkpoint.pth

sent 104 bytes  received 17,659 bytes  35,526.00 bytes/sec
total size is 247,431  speedup is 13.93
receiving incremental file list
./
checkpoint_000028/
checkpoint_000028/.is_checkpoint
checkpoint_000028/.tune_metadata
checkpoint_000028/checkpoint.pth

sent 105 bytes  received 17,743 bytes  11,898.67 bytes/sec
total size is 266,174  speedup is 14.91
receiving incremental file list
./
checkpoint_000030/
checkpoint_000030/.is_checkpoint
checkpoint_000030/.tune_metadata
checkpoint_000030/checkpoint.pth

sent 106 bytes  received 17,878 bytes  35,968.00 bytes/sec
total size is 284,917  speedup is 15.84
receiving incremental file list
./
checkpoint_000032/
checkpoint_000032/.is_checkpoint
checkpoint_000032/.tune_metadata
checkpoint_000032/checkpoint.pth

sent 107 bytes  received 17,980 bytes  36,174.00 bytes/sec
total size is 303,660  speedup is 16.79
receiving incremental file list
./
checkpoint_000034/
checkpoint_000034/.is_checkpoint
checkpoint_000034/.tune_metadata
checkpoint_000034/checkpoint.pth

sent 108 bytes  received 18,081 bytes  12,126.00 bytes/sec
total size is 322,403  speedup is 17.73
receiving incremental file list
./
checkpoint_000036/
checkpoint_000036/.is_checkpoint
checkpoint_000036/.tune_metadata
checkpoint_000036/checkpoint.pth

sent 109 bytes  received 18,189 bytes  36,596.00 bytes/sec
total size is 341,146  speedup is 18.64
receiving incremental file list
./
checkpoint_000038/
checkpoint_000038/.is_checkpoint
checkpoint_000038/.tune_metadata
checkpoint_000038/checkpoint.pth

sent 110 bytes  received 18,293 bytes  36,806.00 bytes/sec
total size is 359,889  speedup is 19.56
receiving incremental file list
./
checkpoint_000040/
checkpoint_000040/.is_checkpoint
checkpoint_000040/.tune_metadata
checkpoint_000040/checkpoint.pth

sent 111 bytes  received 18,405 bytes  12,344.00 bytes/sec
total size is 378,632  speedup is 20.45
receiving incremental file list
./
checkpoint_000042/
checkpoint_000042/.is_checkpoint
checkpoint_000042/.tune_metadata
checkpoint_000042/checkpoint.pth

sent 112 bytes  received 18,513 bytes  12,416.67 bytes/sec
total size is 397,375  speedup is 21.34
receiving incremental file list
./
checkpoint_000044/
checkpoint_000044/.is_checkpoint
checkpoint_000044/.tune_metadata
checkpoint_000044/checkpoint.pth

sent 113 bytes  received 18,616 bytes  37,458.00 bytes/sec
total size is 416,118  speedup is 22.22
receiving incremental file list
./
checkpoint_000046/
checkpoint_000046/.is_checkpoint
checkpoint_000046/.tune_metadata
checkpoint_000046/checkpoint.pth

sent 114 bytes  received 18,731 bytes  37,690.00 bytes/sec
total size is 434,861  speedup is 23.08
receiving incremental file list
./
checkpoint_000048/
checkpoint_000048/.is_checkpoint
checkpoint_000048/.tune_metadata
checkpoint_000048/checkpoint.pth

sent 115 bytes  received 18,836 bytes  12,634.00 bytes/sec
total size is 453,604  speedup is 23.94
receiving incremental file list
./
checkpoint_000050/
checkpoint_000050/.is_checkpoint
checkpoint_000050/.tune_metadata
checkpoint_000050/checkpoint.pth

sent 116 bytes  received 18,930 bytes  12,697.33 bytes/sec
total size is 472,347  speedup is 24.80
receiving incremental file list
./
checkpoint_000052/
checkpoint_000052/.is_checkpoint
checkpoint_000052/.tune_metadata
checkpoint_000052/checkpoint.pth

sent 117 bytes  received 19,042 bytes  12,772.67 bytes/sec
total size is 491,090  speedup is 25.63
receiving incremental file list
./
checkpoint_000054/
checkpoint_000054/.is_checkpoint
checkpoint_000054/.tune_metadata
checkpoint_000054/checkpoint.pth

sent 118 bytes  received 19,155 bytes  38,546.00 bytes/sec
total size is 509,833  speedup is 26.45
receiving incremental file list
./
checkpoint_000056/
checkpoint_000056/.is_checkpoint
checkpoint_000056/.tune_metadata
checkpoint_000056/checkpoint.pth

sent 119 bytes  received 19,235 bytes  38,708.00 bytes/sec
total size is 528,576  speedup is 27.31
receiving incremental file list
./
checkpoint_000058/
checkpoint_000058/.is_checkpoint
checkpoint_000058/.tune_metadata
checkpoint_000058/checkpoint.pth

sent 120 bytes  received 19,365 bytes  38,970.00 bytes/sec
total size is 547,319  speedup is 28.09
receiving incremental file list
./
checkpoint_000060/
checkpoint_000060/.is_checkpoint
checkpoint_000060/.tune_metadata
checkpoint_000060/checkpoint.pth

sent 121 bytes  received 19,431 bytes  13,034.67 bytes/sec
total size is 566,062  speedup is 28.95
receiving incremental file list
./
checkpoint_000062/
checkpoint_000062/.is_checkpoint
checkpoint_000062/.tune_metadata
checkpoint_000062/checkpoint.pth

sent 122 bytes  received 19,532 bytes  13,102.67 bytes/sec
total size is 584,805  speedup is 29.76
receiving incremental file list
./
checkpoint_000064/
checkpoint_000064/.is_checkpoint
checkpoint_000064/.tune_metadata
checkpoint_000064/checkpoint.pth

sent 123 bytes  received 19,688 bytes  13,207.33 bytes/sec
total size is 603,548  speedup is 30.47
receiving incremental file list
./
checkpoint_000066/
checkpoint_000066/.is_checkpoint
checkpoint_000066/.tune_metadata
checkpoint_000066/checkpoint.pth

sent 124 bytes  received 19,770 bytes  39,788.00 bytes/sec
total size is 622,291  speedup is 31.28
receiving incremental file list
./
checkpoint_000068/
checkpoint_000068/.is_checkpoint
checkpoint_000068/.tune_metadata
checkpoint_000068/checkpoint.pth

sent 125 bytes  received 19,865 bytes  39,980.00 bytes/sec
total size is 641,034  speedup is 32.07
receiving incremental file list
./
checkpoint_000070/
checkpoint_000070/.is_checkpoint
checkpoint_000070/.tune_metadata
checkpoint_000070/checkpoint.pth

sent 126 bytes  received 19,977 bytes  40,206.00 bytes/sec
total size is 659,777  speedup is 32.82
receiving incremental file list
./
checkpoint_000072/
checkpoint_000072/.is_checkpoint
checkpoint_000072/.tune_metadata
checkpoint_000072/checkpoint.pth

sent 127 bytes  received 20,108 bytes  40,470.00 bytes/sec
total size is 678,520  speedup is 33.53
receiving incremental file list
./
checkpoint_000074/
checkpoint_000074/.is_checkpoint
checkpoint_000074/.tune_metadata
checkpoint_000074/checkpoint.pth

sent 128 bytes  received 20,207 bytes  40,670.00 bytes/sec
total size is 697,263  speedup is 34.29
receiving incremental file list
./
checkpoint_000076/
checkpoint_000076/.is_checkpoint
checkpoint_000076/.tune_metadata
checkpoint_000076/checkpoint.pth

sent 129 bytes  received 20,294 bytes  40,846.00 bytes/sec
total size is 716,006  speedup is 35.06
receiving incremental file list
./
checkpoint_000078/
checkpoint_000078/.is_checkpoint
checkpoint_000078/.tune_metadata
checkpoint_000078/checkpoint.pth

sent 130 bytes  received 20,421 bytes  41,102.00 bytes/sec
total size is 734,749  speedup is 35.75
receiving incremental file list
./
checkpoint_000080/
checkpoint_000080/.is_checkpoint
checkpoint_000080/.tune_metadata
checkpoint_000080/checkpoint.pth

sent 131 bytes  received 20,486 bytes  41,234.00 bytes/sec
total size is 753,492  speedup is 36.55
receiving incremental file list
./
checkpoint_000082/
checkpoint_000082/.is_checkpoint
checkpoint_000082/.tune_metadata
checkpoint_000082/checkpoint.pth

sent 132 bytes  received 20,601 bytes  13,822.00 bytes/sec
total size is 772,235  speedup is 37.25
receiving incremental file list
./
checkpoint_000084/
checkpoint_000084/.is_checkpoint
checkpoint_000084/.tune_metadata
checkpoint_000084/checkpoint.pth

sent 133 bytes  received 20,748 bytes  13,920.67 bytes/sec
total size is 790,978  speedup is 37.88
receiving incremental file list
./
checkpoint_000086/
checkpoint_000086/.is_checkpoint
checkpoint_000086/.tune_metadata
checkpoint_000086/checkpoint.pth

sent 134 bytes  received 20,828 bytes  13,974.67 bytes/sec
total size is 809,721  speedup is 38.63
receiving incremental file list
./
checkpoint_000088/
checkpoint_000088/.is_checkpoint
checkpoint_000088/.tune_metadata
checkpoint_000088/checkpoint.pth

sent 135 bytes  received 20,934 bytes  42,138.00 bytes/sec
total size is 828,464  speedup is 39.32
receiving incremental file list
./
checkpoint_000090/
checkpoint_000090/.is_checkpoint
checkpoint_000090/.tune_metadata
checkpoint_000090/checkpoint.pth

sent 136 bytes  received 21,029 bytes  42,330.00 bytes/sec
total size is 847,207  speedup is 40.03
receiving incremental file list
./
checkpoint_000092/
checkpoint_000092/.is_checkpoint
checkpoint_000092/.tune_metadata
checkpoint_000092/checkpoint.pth

sent 137 bytes  received 21,131 bytes  42,536.00 bytes/sec
total size is 865,950  speedup is 40.72
receiving incremental file list
./
checkpoint_000094/
checkpoint_000094/.is_checkpoint
checkpoint_000094/.tune_metadata
checkpoint_000094/checkpoint.pth

sent 138 bytes  received 21,259 bytes  42,794.00 bytes/sec
total size is 884,693  speedup is 41.35
receiving incremental file list
./
checkpoint_000096/
checkpoint_000096/.is_checkpoint
checkpoint_000096/.tune_metadata
checkpoint_000096/checkpoint.pth

sent 139 bytes  received 21,345 bytes  14,322.67 bytes/sec
total size is 903,436  speedup is 42.05
receiving incremental file list
./
checkpoint_000098/
checkpoint_000098/.is_checkpoint
checkpoint_000098/.tune_metadata
checkpoint_000098/checkpoint.pth

sent 140 bytes  received 21,450 bytes  14,393.33 bytes/sec
total size is 922,179  speedup is 42.71

Is this the file you were referring to? If so, I did set 100 iterations per trial, saving every 2 iterations, and this file shows indeed that checkpoint_000100 is missing.

rliaw · April 18, 2021, 11:25pm

OK I see. Could you actually post the code that you’ve written?

I think there might be something odd with setting checkpoint_at_end.

RickLan · April 22, 2021, 6:40am

Hi @marta_hum

With a cluster manually setup, I also see that last checkpoint are missing.

Your setup saves intermediate checkpoints. Maybe that’s what I should try.

May I ask which version of Ray are you running?

marta_hum · April 22, 2021, 10:35am

Hi @rliaw, the code I’ve written is the one I’ve posted at the beginning of this thread:

marta_hum:

import os
import argparse
from filelock import FileLock
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms

import ray
from ray import tune
from ray.tune.schedulers import AsyncHyperBandScheduler

# Change these values if you want the training to run quicker or slower.
EPOCH_SIZE = 512
TEST_SIZE = 256


class ConvNet(nn.Module):
    def __init__(self):
        super(ConvNet, self).__init__()
        self.conv1 = nn.Conv2d(1, 3, kernel_size=3)
        self.fc = nn.Linear(192, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 3))
        x = x.view(-1, 192)
        x = self.fc(x)
        return F.log_softmax(x, dim=1)


def train(model, optimizer, train_loader, device=None):
    device = device or torch.device("cpu")
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        if batch_idx * len(data) > EPOCH_SIZE:
            return
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()


def test(model, data_loader, device=None):
    device = device or torch.device("cpu")
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for batch_idx, (data, target) in enumerate(data_loader):
            if batch_idx * len(data) > TEST_SIZE:
                break
            data, target = data.to(device), target.to(device)
            outputs = model(data)
            _, predicted = torch.max(outputs.data, 1)
            total += target.size(0)
            correct += (predicted == target).sum().item()

    return correct / total


def get_data_loaders():
    mnist_transforms = transforms.Compose(
        [transforms.ToTensor(),
         transforms.Normalize((0.1307, ), (0.3081, ))])

    # We add FileLock here because multiple workers will want to
    # download data, and this may cause overwrites since
    # DataLoader is not threadsafe.
    with FileLock(os.path.expanduser("~/data.lock")):
        train_loader = torch.utils.data.DataLoader(
            datasets.KMNIST(
                "~/data",
                train=True,
                download=True,
                transform=mnist_transforms),
            batch_size=64,
            shuffle=True)
    test_loader = torch.utils.data.DataLoader(
        datasets.KMNIST("~/data", train=False, transform=mnist_transforms),
        batch_size=64,
        shuffle=True)
    return train_loader, test_loader


def train_mnist(config, checkpoint_dir=None, save_every_n=2):
    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")
    train_loader, test_loader = get_data_loaders()
    model = ConvNet().to(device)

    optimizer = optim.SGD(
        model.parameters(), lr=config["lr"], momentum=config["momentum"])

    if checkpoint_dir:
        checkpoint_file = os.path.join(checkpoint_dir, "checkpoint.pth")
        model_state, optimizer_state = torch.load(checkpoint_file)
        model.load_state_dict(model_state)
        optimizer.load_state_dict(optimizer_state)

    step = 0
    while True:
        train(model, optimizer, train_loader, device)
        acc = test(model, test_loader, device)

        step += 1
        if step % save_every_n == 0:
            with tune.checkpoint_dir(step=step) as checkpoint_dir:
                path = os.path.join(checkpoint_dir, "checkpoint.pth")
                torch.save((model.state_dict(), optimizer.state_dict()), path)

        # Set this to run Tune.
        tune.report(mean_accuracy=acc)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="PyTorch MNIST Example")
    parser.add_argument(
        "--cuda",
        action="store_true",
        default=False,
        help="Enables GPU training")
    parser.add_argument(
        "--smoke-test", action="store_true", help="Finish quickly for testing")
    args = parser.parse_args()

    try:
        ray.init(address="auto")
    except:
        ray.init()

    # for early stopping
    sched = AsyncHyperBandScheduler()

    analysis = tune.run(
        train_mnist,
        metric="mean_accuracy",
        mode="max",
        name="exp",
        scheduler=sched,
        stop={
            "mean_accuracy": 0.98,
            "training_iteration": 5 if args.smoke_test else 100
        },
        resources_per_trial={
            "cpu": 4,
            "gpu": int(args.cuda)  # set this for GPUs
        },
        num_samples=1 if args.smoke_test else 20,
        config={
            "lr": tune.loguniform(1e-4, 1e-2),
            "momentum": tune.uniform(0.1, 0.9),
        })

    print("Best config is:", analysis.best_config)

I am not setting anything for checkpoint_at_end so I suppose it is using the default for this argument. If I understand it correctly, checkpoint_at_end does not have any effect with the Functional Training API, so it should not impact my case, should it?

marta_hum · April 22, 2021, 10:37am

Hi @RickLan,

I am using Ray v2.0.0.dev0

I understand you are facing a problem similar to mine? I do manage to save intermediate checkpoints but the last one fails at syncing.

Topic		Replies	Views
Using Ray Tune in a Ray Cluster, checkpoints not synced back to source node Ray Tune	9	1013	April 22, 2021
Cannot find checkpoint when gpus_per_trial > 0 Ray Tune	8	637	February 28, 2023
Correct way of resuming trials	4	2385	May 20, 2022
Checkpointing with distributed training Ray Tune	14	875	April 20, 2021
BOHB restarting from scratch rather than last checkpoint with GCP syncing Ray Tune	5	352	March 25, 2022

Ray Tune on GCP cluster: checkpoint not found after successful sync down

Related topics