Make Ray Tune not write files

Hello,

so Ray Tune writes hundrets of thousands of files and that’s just too much for me. I’m not interested in those files, I write my own output for now using the analysis object.

How can I turn file writing off? Can I use ASHAScheduler without it writing checkpoints?

1 Like

Hi,

can you elaborate a bit on your problem? Which files are written? How does you training code look like?

Hundreds of thousands of files is not normal behavior and unexpected, so chances are that there is a configuration error. For instance, if you set the checkpoint frequency to each iteration and run a trial for thousands of iterations, you end up with thousands of checkpoints. In that case you should just drastically decrease the checkpoint writing frequency.

First let me ask this: Does ASHAScheduler need checkpoints or could I just get rid of that checkpoint saving? I do save a ton of checkpoints apparently.

I really just want it to not write any files. Maybe you just want to test something, debug something or whatever. I mean there must be such an option no? Even if you implemented checkpointing.

I’m talking about the res_results directory.

Honestly, if it has to write one file per checkpoint or whatever then 100k doesn’t sound like much to me. But it’s too much for me since I just want to try out stuff and I can’t allow that kind of file system usage.

def main(config, checkpoint_dir=‘checkpoints’):
debug = False

if checkpoint_dir:
    with open(os.path.join(checkpoint_dir, "checkpoint")) as f:
        state = json.loads(f.read())
        start = state["step"] + 1

# Parameters
learning_rate = config["learning_rate"]
batch_size = config["batch_size"]
momentum = config["momentum"]
epochs = 100

# Read train data
mutations_dataset = mutationsDataset(
    csv_file='/cluster/home/user/iml21/task_3/data/train.csv')

# Get dataloaders
if debug == False:
    train_dataloader, evaluation_dataloader = getDataloaders(
            mutations_dataset, batch_size)
else:
    epochs = 1000
    # Choose a small subset
    train_data = torch.utils.data.Subset(train_data, range(100))
    train_dataloader = torch.utils.data.DataLoader(train_data)
    evaluation_dataloader = train_dataloader

# Get model
model = NeuralNetwork(config["l1"], config["l2"], config["dropout1"], config["dropout2"]).to(device)

# Get loss function
loss_fn = nn.BCEWithLogitsLoss()

# Get optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, betas=(0.9, 0.999))
#optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum)

# Epochs
for t in range(epochs):
    print(f"\nEpoch {t+1}")
    train(config, train_dataloader, model, loss_fn, optimizer, debug)
    score = evaluation(config, evaluation_dataloader, model, debug)

    # Obtain a checkpoint directory
    with tune.checkpoint_dir(step=t) as checkpoint_dir:
        path = os.path.join(checkpoint_dir, "checkpoint")
        with open(path, "w") as f:
            f.write(json.dumps({"step": t}))

    tune.report(score=score)

print("Done!")

and

asha_scheduler = ASHAScheduler(
time_attr='training_iteration',
metric='score',
mode='max',
max_t=100,
grace_period=10,
reduction_factor=3,
brackets=1)

analysis = tune.run(
    main,
    config=params,
    num_samples=20,
    scheduler=asha_scheduler
)

oh and sure I can point it towards another folder but the max. file quota I have is 1 million and if I start a bunch of jobs wit ha lot of possible parameters it looks like ray tune would go crazy and just easily fill those 1mil quota.

I’m probably doing something wrong, I’m happy to learn what but as I said, the main question remains: Can I turn off the file writing?

I am also interested in knowing how to turn off all checkpointing. I tried: checkpoint_freq=0,
in tune.run doesn’t seem to work.

Hi @someuser @ggerog does this thread help?

@ggerog checkpoint_freq would not work in my opinion as the argument is named checkpoint_frequency.

I get the following if I set that to zero:

and if I run with checkpoint_freq=0 I still get checkpoints and it is impossible to put keep_checkpoints_num=0 without getting an error:

RuntimeError: If checkpointing is enabled, Ray Tune requires keep_checkpoints_num to be None or a number greater than 0

@ggerog What Ray version are you using?

Same issue here. Ray Tune wrote 50GB into some /home/ray_results folder which killed my server. How can I make all HD writing stop?