[Ray Tune] RSS memory leak of the driver process

Ray Tune memory leak: RSS grows linearly with trial count in tune.run()

Environment

  • Ray version: 2.55.1
  • Python: 3.10.13
  • OS: Debian GNU/Linux 12 (bookworm)

Description

When running a large number of trials via tune.run() in a single process (local mode, resources_per_trial={"cpu": 1}), the driver process RSS grows linearly with the number of completed trials, eventually leading to OOM.

Reproduction

"""
python test_ray_tune.py [--trials 4000] [--report-interval 100]
"""
import argparse, os, psutil, ray
from ray import tune
from ray.tune import Callback
from ray.tune.search.basic_variant import BasicVariantGenerator

def parse_args():
    p = argparse.ArgumentParser()
    p.add_argument("--trials", type=int, default=4000)
    p.add_argument("--report-interval", type=int, default=100)
    return p.parse_args()

class RssReporter(Callback):
    def __init__(self, interval, process=None):
        self.interval = interval
        self.process = process
        self.count = 0
    def on_trial_complete(self, iteration, trials, trial, **info):
        self.count += 1
        if self.count % self.interval: return
        rss = self.process.memory_info().rss if self.process else 0
        print(f"trial#{self.count}: RSS={rss / 1e9:.3f} GB")

def train_fn(config):
    x = config["x"]; y = config["y"]; z = config["z"]
    tune.report({"score": x**2 + y**2 + (ord(z) - 97)})

def main():
    args = parse_args()
    process = psutil.Process(os.getpid())

    tune.run(
        train_fn,
        config={"x": tune.uniform(-10, 10), "y": tune.uniform(-5, 5), "z": tune.choice(["a","b","c"])},
        metric="score", mode="min",
        num_samples=args.trials,
        search_alg=BasicVariantGenerator(),
        resources_per_trial={"cpu": 1}, verbose=1,
        callbacks=[RssReporter(args.report_interval, process)],
    )
    print(f"Final RSS: {process.memory_info().rss / 1e9:.3f} GB")
    ray.shutdown()

if __name__ == "__main__":
    main()

Observed behavior

$ python test_ray_tune.py --trials 4000 --report-interval 100 |grep RSS
   trial     RSS_GB
trial#100: RSS=0.672 GB
trial#200: RSS=0.680 GB
trial#300: RSS=0.686 GB
trial#400: RSS=0.692 GB
trial#500: RSS=0.702 GB
trial#600: RSS=0.704 GB
....
trial#1000: RSS=1.084 GB

RSS grows monotonically with trial count. Each batch of 100 completed trials adds a roughly constant amount of memory, suggesting per-trial state is retained and never released.

Expected behavior

For lightweight trials that each report a single scalar, the driver process RSS should remain relatively stable after initial ramp-up, not grow linearly to the point of OOM.

Impact

For workloads with thousands of trials (hyperparameter sweeps, NAS, etc.), the driver process can OOM before all trials complete. This blocks production hyperparameter search runs.

Versions / Dependencies

  • Ray version: 2.55.1
  • Python: 3.10.13
  • OS: Debian GNU/Linux 12 (bookworm)

Reproduction script

"""
python test_ray_tune.py [--trials 4000] [--report-interval 100]
"""
import argparse, os, psutil, ray
from ray import tune
from ray.tune import Callback
from ray.tune.search.basic_variant import BasicVariantGenerator

def parse_args():
    p = argparse.ArgumentParser()
    p.add_argument("--trials", type=int, default=4000)
    p.add_argument("--report-interval", type=int, default=100)
    return p.parse_args()

class RssReporter(Callback):
    def __init__(self, interval, process=None):
        self.interval = interval
        self.process = process
        self.count = 0
    def on_trial_complete(self, iteration, trials, trial, **info):
        self.count += 1
        if self.count % self.interval: return
        rss = self.process.memory_info().rss if self.process else 0
        print(f"trial#{self.count}: RSS={rss / 1e9:.3f} GB")

def train_fn(config):
    x = config["x"]; y = config["y"]; z = config["z"]
    tune.report({"score": x**2 + y**2 + (ord(z) - 97)})

def main():
    args = parse_args()
    process = psutil.Process(os.getpid())

    tune.run(
        train_fn,
        config={"x": tune.uniform(-10, 10), "y": tune.uniform(-5, 5), "z": tune.choice(["a","b","c"])},
        metric="score", mode="min",
        num_samples=args.trials,
        search_alg=BasicVariantGenerator(),
        resources_per_trial={"cpu": 1}, verbose=1,
        callbacks=[RssReporter(args.report_interval, process)],
    )
    print(f"Final RSS: {process.memory_info().rss / 1e9:.3f} GB")
    ray.shutdown()

if __name__ == "__main__":
    main()

Issue Severity

High: It blocks me from completing my task.

This is a known issue: in Ray Tune (e.g., v2.55.1), the driver process’s RSS grows linearly with the number of trials because per-trial state (Trial objects, metadata, and experiment checkpoints) is retained in memory and not released, even after trials complete. This is partly by design—Tune keeps all Trial objects and their metadata for result reporting and checkpointing, which inherently scales with the number of trials and can lead to OOM for large sweeps. Some memory leaks in loggers and trial decision queues have been fixed, but the fundamental linear growth remains unless the architecture is changed to spill completed trial state to disk instead of memory. See the detailed discussion and proposed fixes in Ray GitHub Issue #64231 and PR #64335.

If you do not need to keep all trial results in memory (e.g., you do not use ExperimentAnalysis/ResultGrid), there is currently no built-in way to disable this retention. Workarounds like manually clearing trial metadata or using callbacks to clean up state have only limited effect, as the controller and checkpointing logic still hold references. For truly flat memory usage, a deeper architectural change is needed, which is not yet implemented. Would you like a step-by-step breakdown of the memory retention points and possible mitigations?

Sources:

Hint: Mention @Herald in the post for followups.

@RunLLM When running on multi node cluster, the leak could be 0.1MB per trial, I highly doubt metadata and json could contribute such a huge leak.

xxxxx Who is @Herald BTW?