Ray tune log grows extremely large

I am running a simple modification of the Ray Tune Quick Start example from the docs Tune: Scalable Hyperparameter Tuning — Ray v1.2.0
(taking a uniform distribution for alpha and making it run indefinitely)

from ray import tune


def objective(step, alpha, beta):
    return (0.1 + alpha * step / 100)**(-1) + beta * 0.1


def training_function(config):
    # Hyperparameters
    alpha, beta = config["alpha"], config["beta"]
    for step in range(10):
        # Iterative training function - can be any arbitrary training procedure.
        intermediate_score = objective(step, alpha, beta)
        # Feed the score back back to Tune.
        tune.report(mean_loss=intermediate_score)


analysis = tune.run(
    training_function, num_samples=-1,
    config={
        "alpha": tune.uniform(0.001, 0.1),
        "beta": tune.choice([1, 2, 3])
    })

print("Best config: ", analysis.get_best_config(
    metric="mean_loss", mode="min"))

# Get a dataframe for analyzing trial results.
df = analysis.results_df

When I let the tuning run, I notice that the log directory in /tmp/ray/session_latest/logs grows extremely large. In particular, the file gcs_server.out grows to ca. 100 MB in one minute, and contains just a series of the following messages:

[2021-04-30 15:07:58,109 I 16431 16431] gcs_placement_group_manager.cc:292: Registering placement group, placement group id = 8c725267c9a2384dbcd107adc450d63c, name = __tune_a640a6de__6faa9d03, strategy = 0
[2021-04-30 15:07:58,109 I 16431 16431] gcs_placement_group_manager.cc:296: Finished registering placement group, placement group id = 8c725267c9a2384dbcd107adc450d63c, name = __tune_a640a6de__6faa9d03, strategy = 0
[2021-04-30 15:07:58,109 I 16431 16431] gcs_placement_group_scheduler.cc:141: Scheduling placement group __tune_a640a6de__a25b2634, id: 8d3433fc1f7eb1df3fb8cde4757b0e8a, bundles size = 1
[2021-04-30 15:07:58,109 I 16431 16431] gcs_placement_group_scheduler.cc:150: Failed to schedule placement group __tune_a640a6de__a25b2634, id: 8d3433fc1f7eb1df3fb8cde4757b0e8a, because no nodes are available.
[2021-04-30 15:07:58,109 I 16431 16431] gcs_placement_group_manager.cc:215: Failed to create placement group __tune_a640a6de__a25b2634, id: 8d3433fc1f7eb1df3fb8cde4757b0e8a, try again.

When running the tuning for a couple of hours, my disk has no space left due to just this one file.

Is there a way to a) disable the logging or b) fix the problem reported in the logs?

I’m running the tuning using ray version 1.3.0 from pip on Ubuntu 16.04.
Please let me know what other information I should provide to help reproduce the problem.

Yeah, this is an issue that was also reported on Github: [tune] Logs fill up disk space causing a "No space left on device" error · Issue #15595 · ray-project/ray · GitHub

We’ll be taking a look at this soon. I think perhaps consider just setting TUNE_MAX_PENDING_TRIALS_PG=1 for now?

Thanks, that does reduce the log size significantly! However, it also seems to limit the parallel execution of the trials, so I set TUNE_MAX_PENDING_TRIALS_PG to the number of CPUs used (which I also pass to ray.init).

How exactly is TUNE_MAX_PENDING_TRIALS_PG related to parallel trial execution? I could not understand it from the documentation.

Hi Daniel,

TUNE_MAX_PENDING_TRIALS_PG limits the number of Trials currently in the PENDING state. This means that Tune should still be able to leverage the full resources on the cluster, but might take a little bit longer to start these trials. However, in most cases and especially on single machines this should only be a couple of seconds - does this prevent parallel execution for you completely?

Setting it to the number of available CPUs is definitely a good choice though.

1 Like

BTW @sangcho would it be easy to reduce the amount of logging that is done?

Yeah it should be easy to do it, and I will work on it next week.

Is it the understanding of the maintainers that, per the PR that is linked to being closed, this is an issue that has been fixed on more recent releases and/or on Github? Asking because someone in the group I work with is currently experiencing this issue, much to the misfortune of our SSD’s ongoing available space, and I’m wondering whether the current best solution is still to limit TUNE_MAX_PENDING_TRIALS_PG

Hey @Cody_Wild, great to see you here! This should be fixed on 1.4.1 (the latest Ray release).

If you cannot upgrade, limiting the env var would be best.