Ray tune log grows extremely large

daniel-pp · April 30, 2021, 1:38pm

I am running a simple modification of the Ray Tune Quick Start example from the docs Tune: Scalable Hyperparameter Tuning — Ray v1.2.0
(taking a uniform distribution for alpha and making it run indefinitely)

from ray import tune


def objective(step, alpha, beta):
    return (0.1 + alpha * step / 100)**(-1) + beta * 0.1


def training_function(config):
    # Hyperparameters
    alpha, beta = config["alpha"], config["beta"]
    for step in range(10):
        # Iterative training function - can be any arbitrary training procedure.
        intermediate_score = objective(step, alpha, beta)
        # Feed the score back back to Tune.
        tune.report(mean_loss=intermediate_score)


analysis = tune.run(
    training_function, num_samples=-1,
    config={
        "alpha": tune.uniform(0.001, 0.1),
        "beta": tune.choice([1, 2, 3])
    })

print("Best config: ", analysis.get_best_config(
    metric="mean_loss", mode="min"))

# Get a dataframe for analyzing trial results.
df = analysis.results_df

When I let the tuning run, I notice that the log directory in /tmp/ray/session_latest/logs grows extremely large. In particular, the file gcs_server.out grows to ca. 100 MB in one minute, and contains just a series of the following messages:

[2021-04-30 15:07:58,109 I 16431 16431] gcs_placement_group_manager.cc:292: Registering placement group, placement group id = 8c725267c9a2384dbcd107adc450d63c, name = __tune_a640a6de__6faa9d03, strategy = 0
[2021-04-30 15:07:58,109 I 16431 16431] gcs_placement_group_manager.cc:296: Finished registering placement group, placement group id = 8c725267c9a2384dbcd107adc450d63c, name = __tune_a640a6de__6faa9d03, strategy = 0
[2021-04-30 15:07:58,109 I 16431 16431] gcs_placement_group_scheduler.cc:141: Scheduling placement group __tune_a640a6de__a25b2634, id: 8d3433fc1f7eb1df3fb8cde4757b0e8a, bundles size = 1
[2021-04-30 15:07:58,109 I 16431 16431] gcs_placement_group_scheduler.cc:150: Failed to schedule placement group __tune_a640a6de__a25b2634, id: 8d3433fc1f7eb1df3fb8cde4757b0e8a, because no nodes are available.
[2021-04-30 15:07:58,109 I 16431 16431] gcs_placement_group_manager.cc:215: Failed to create placement group __tune_a640a6de__a25b2634, id: 8d3433fc1f7eb1df3fb8cde4757b0e8a, try again.

When running the tuning for a couple of hours, my disk has no space left due to just this one file.

Is there a way to a) disable the logging or b) fix the problem reported in the logs?

I’m running the tuning using ray version 1.3.0 from pip on Ubuntu 16.04.
Please let me know what other information I should provide to help reproduce the problem.

rliaw · May 2, 2021, 7:08am

Yeah, this is an issue that was also reported on Github: [tune] Logs fill up disk space causing a "No space left on device" error · Issue #15595 · ray-project/ray · GitHub

We’ll be taking a look at this soon. I think perhaps consider just setting TUNE_MAX_PENDING_TRIALS_PG=1 for now?

daniel-pp · May 3, 2021, 11:53am

Thanks, that does reduce the log size significantly! However, it also seems to limit the parallel execution of the trials, so I set TUNE_MAX_PENDING_TRIALS_PG to the number of CPUs used (which I also pass to ray.init).

How exactly is TUNE_MAX_PENDING_TRIALS_PG related to parallel trial execution? I could not understand it from the documentation.

kai · May 3, 2021, 8:45pm

Hi Daniel,

TUNE_MAX_PENDING_TRIALS_PG limits the number of Trials currently in the PENDING state. This means that Tune should still be able to leverage the full resources on the cluster, but might take a little bit longer to start these trials. However, in most cases and especially on single machines this should only be a couple of seconds - does this prevent parallel execution for you completely?

Setting it to the number of available CPUs is definitely a good choice though.

rliaw · May 8, 2021, 4:47pm

BTW @sangcho would it be easy to reduce the amount of logging that is done?

sangcho · May 10, 2021, 8:50pm

Yeah it should be easy to do it, and I will work on it next week.

Cody_Wild · July 7, 2021, 5:28pm

Is it the understanding of the maintainers that, per the PR that is linked to being closed, this is an issue that has been fixed on more recent releases and/or on Github? Asking because someone in the group I work with is currently experiencing this issue, much to the misfortune of our SSD’s ongoing available space, and I’m wondering whether the current best solution is still to limit TUNE_MAX_PENDING_TRIALS_PG

rliaw · July 12, 2021, 7:07pm

Hey @Cody_Wild, great to see you here! This should be fixed on 1.4.1 (the latest Ray release).

If you cannot upgrade, limiting the env var would be best.

Topic		Replies	Views
Ray Tune run out of disk (/tmp/ray) Ray Tune	3	469	July 2, 2021
[Tune] Ray tune terminates after OSError: [Errno 28] No space left on device Ray Tune	2	2669	May 7, 2021
RayTune gets stuck after completing all trials Ray Tune	1	707	February 11, 2022
Ray tune trial log is now showing Ray Tune	1	360	February 27, 2023
Object Spilling useful to avoid running out of memory when using Ray Tune Ray Core	13	959	March 4, 2021

Ray tune log grows extremely large

Related topics