Tmp Folder Filling up with trials

I have a storage server where i can store many TB of the model files for raytune to pause and use.
currently it is filling up my HPC server that has a small drive. in the \AppData\Local\Temp\2 folder in windows. raytune is creating a bunch of folders named “tmpxxxxxxxx” with a model.keras file inside and a metadata files.

I am using V2.8

I am already changing the temp dir and current dir / more wondering what other paths i need to change to fix this!

 ray.init(configure_logging=False,_temp_dir="Z:\\LSTM\\HYPER\\TMP")
    algo = TuneBOHB()
    algo = tune.search.ConcurrencyLimiter(algo, max_concurrent=1)
    scheduler = HyperBandForBOHB(
        time_attr="training_iteration",
        max_t=6000,
        stop_last_trials=False,
    )
    
    X_train, X_test,y_train,y_test = gendata_lstm.GenerateData.GenerateData(120)
    data = {"X_t": X_train, "X_tt": X_test, "y_t":y_train , "y_tt": y_test}
    trainable_with_resources = tune.with_resources(train_mnist, {"gpu": 2})
    
    tuner = tune.Tuner(
        tune.with_parameters(trainable_with_resources, data=data),
        tune_config=tune.TuneConfig(
            metric="val_loss",
            mode="min",
            search_alg=algo,
            scheduler=scheduler,
            num_samples=1000,
        ),
        run_config=train.RunConfig(
            name="LSMT_4L",
            storage_path="\\\\192.168.1.107\\StorageDrive\\LSTM\\HYPER",
            local_dir="\\\\192.168.1.107\\StorageDrive\\LSTM\\HYPER\\TMP",
            stop={"mean_accuracy": 0.99},
        ),
        param_space={
            "lr": tune.uniform(0.000001, 0.1),
            "l1": tune.randint(512, 2048),
            "l2": tune.randint(512, 2048),
            "l3": tune.randint(512, 2048),
            "l4": tune.randint(512, 2048),
            "decay": tune.uniform(1e-5, 1e-2),
        },
    )

Found this in the code for the it was actually hard coded in TensorflowCheckpoint
Please guys let me know if you can change this i had to change the system env variable to fix this

@Diego_Rukoz I bumped into exactly the same thing under Windows environment, and I wrote a short script to clean up the folder from time to time. It is annoying, but for the time being it does not stop me from continuing with my training.

If you can share your script that would be helpful I opened up a bug and it was resolved and merged into the nightly build sadly I ended up trying the fix they added and it was not good enough for the issue I’m facing with storage on the HPC server I wish they would just let us set the directory the temp files I read the code and it’s just a env variable but I don’t want to change it system wide

This is the fix PR that is mentioned: [train] Update TensorFlow ReportCheckpointCallback to delete temporary directory by matthewdeng · Pull Request #41033 · ray-project/ray · GitHub

Would being able to configure the TensorflowCheckpoint temp directory as an argument in its constructor be a suitable fix?

Another workaround would be to subclass the necessary methods of ReportCheckpointCallback and TensorflowCheckpoint to suit your needs!

Yes being able to configure the tmp directory by variable would be awesome!

On Windows, setting the runtime env variables TEMP, TMP and TMPDIR to the target temp directory seems to override the %localappdata%\Temp and force the target directory set in those variables as the checkpoint tmp location.

This is achieved by forcing “tempfile.mkdtemp()” in the TensorflowCheckpoint definition to use those as the temporary windows directory.

This is a great tip; tahnks @ninja0n3