Make Ray Tune not write files

someuser · May 10, 2021, 1:16pm

Hello,

so Ray Tune writes hundrets of thousands of files and that’s just too much for me. I’m not interested in those files, I write my own output for now using the analysis object.

How can I turn file writing off? Can I use ASHAScheduler without it writing checkpoints?

kai · May 10, 2021, 2:34pm

Hi,

can you elaborate a bit on your problem? Which files are written? How does you training code look like?

Hundreds of thousands of files is not normal behavior and unexpected, so chances are that there is a configuration error. For instance, if you set the checkpoint frequency to each iteration and run a trial for thousands of iterations, you end up with thousands of checkpoints. In that case you should just drastically decrease the checkpoint writing frequency.

someuser · May 10, 2021, 3:11pm

First let me ask this: Does ASHAScheduler need checkpoints or could I just get rid of that checkpoint saving? I do save a ton of checkpoints apparently.

I really just want it to not write any files. Maybe you just want to test something, debug something or whatever. I mean there must be such an option no? Even if you implemented checkpointing.

I’m talking about the res_results directory.

Honestly, if it has to write one file per checkpoint or whatever then 100k doesn’t sound like much to me. But it’s too much for me since I just want to try out stuff and I can’t allow that kind of file system usage.

def main(config, checkpoint_dir=‘checkpoints’):
debug = False

if checkpoint_dir:
    with open(os.path.join(checkpoint_dir, "checkpoint")) as f:
        state = json.loads(f.read())
        start = state["step"] + 1

# Parameters
learning_rate = config["learning_rate"]
batch_size = config["batch_size"]
momentum = config["momentum"]
epochs = 100

# Read train data
mutations_dataset = mutationsDataset(
    csv_file='/cluster/home/user/iml21/task_3/data/train.csv')

# Get dataloaders
if debug == False:
    train_dataloader, evaluation_dataloader = getDataloaders(
            mutations_dataset, batch_size)
else:
    epochs = 1000
    # Choose a small subset
    train_data = torch.utils.data.Subset(train_data, range(100))
    train_dataloader = torch.utils.data.DataLoader(train_data)
    evaluation_dataloader = train_dataloader

# Get model
model = NeuralNetwork(config["l1"], config["l2"], config["dropout1"], config["dropout2"]).to(device)

# Get loss function
loss_fn = nn.BCEWithLogitsLoss()

# Get optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, betas=(0.9, 0.999))
#optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum)

# Epochs
for t in range(epochs):
    print(f"\nEpoch {t+1}")
    train(config, train_dataloader, model, loss_fn, optimizer, debug)
    score = evaluation(config, evaluation_dataloader, model, debug)

    # Obtain a checkpoint directory
    with tune.checkpoint_dir(step=t) as checkpoint_dir:
        path = os.path.join(checkpoint_dir, "checkpoint")
        with open(path, "w") as f:
            f.write(json.dumps({"step": t}))

    tune.report(score=score)

print("Done!")

and

asha_scheduler = ASHAScheduler(
time_attr='training_iteration',
metric='score',
mode='max',
max_t=100,
grace_period=10,
reduction_factor=3,
brackets=1)

analysis = tune.run(
    main,
    config=params,
    num_samples=20,
    scheduler=asha_scheduler
)

oh and sure I can point it towards another folder but the max. file quota I have is 1 million and if I start a bunch of jobs wit ha lot of possible parameters it looks like ray tune would go crazy and just easily fill those 1mil quota.

I’m probably doing something wrong, I’m happy to learn what but as I said, the main question remains: Can I turn off the file writing?

ggerog · October 3, 2022, 3:22pm

I am also interested in knowing how to turn off all checkpointing. I tried: checkpoint_freq=0,
in tune.run doesn’t seem to work.

Lars_Simon_Zehnder · October 10, 2022, 12:40pm

Hi @someuser @ggerog does this thread help?

Lars_Simon_Zehnder · October 10, 2022, 5:54pm

@ggerog checkpoint_freq would not work in my opinion as the argument is named checkpoint_frequency.

ggerog · October 11, 2022, 6:37am

I get the following if I set that to zero:

and if I run with checkpoint_freq=0 I still get checkpoints and it is impossible to put keep_checkpoints_num=0 without getting an error:

RuntimeError: If checkpointing is enabled, Ray Tune requires keep_checkpoints_num to be None or a number greater than 0

Lars_Simon_Zehnder · October 14, 2022, 7:32am

@ggerog What Ray version are you using?

ndvbd · May 17, 2023, 5:49am

Same issue here. Ray Tune wrote 50GB into some /home/ray_results folder which killed my server. How can I make all HD writing stop?

Topic		Replies	Views
Trouble with some results from Ray Tune Ray Libraries (Data, Train, Tune, Serve)	1	25	August 7, 2024
[Tune] How to turn off checkpointing for testing Ray Tune	20	2844	April 18, 2023
Logs and results are cleared when the program ends Ray Tune	3	243	November 9, 2023
Most runs immediately failing with "out of memory" Ray Tune	5	1130	May 11, 2021
Saving best checkpoint - tune is saving first iterations instead Ray Tune	1	479	October 18, 2021

Make Ray Tune not write files

Related topics