[Ray Tune] RSS memory leak of the driver process

zhangsikai123 · June 28, 2026, 8:15am

Ray Tune memory leak: RSS grows linearly with trial count in tune.run()

[Tune] Ray Tune memory leak: RSS grows linearly with trial count in tune.run()

opened 05:33AM - 20 Jun 26 UTC

bug tune triage stability community-backlog

### What happened + What you expected to happen # Ray Tune memory leak: RSS gro…ws linearly with trial count in tune.run() ## Environment - **Ray version:** 2.55.1 - **Python:** 3.10.13 - **OS:** Debian GNU/Linux 12 (bookworm) ## Description When running a large number of trials via `tune.run()` in a single process (local mode, `resources_per_trial={"cpu": 1}`), the driver process RSS grows linearly with the number of completed trials, eventually leading to OOM. ## Reproduction ```python """ python test_ray_tune.py [--trials 4000] [--report-interval 100] """ import argparse, os, psutil, ray from ray import tune from ray.tune import Callback from ray.tune.search.basic_variant import BasicVariantGenerator def parse_args(): p = argparse.ArgumentParser() p.add_argument("--trials", type=int, default=4000) p.add_argument("--report-interval", type=int, default=100) return p.parse_args() class RssReporter(Callback): def __init__(self, interval, process=None): self.interval = interval self.process = process self.count = 0 def on_trial_complete(self, iteration, trials, trial, **info): self.count += 1 if self.count % self.interval: return rss = self.process.memory_info().rss if self.process else 0 print(f"trial#{self.count}: RSS={rss / 1e9:.3f} GB") def train_fn(config): x = config["x"]; y = config["y"]; z = config["z"] tune.report({"score": x**2 + y**2 + (ord(z) - 97)}) def main(): args = parse_args() process = psutil.Process(os.getpid()) tune.run( train_fn, config={"x": tune.uniform(-10, 10), "y": tune.uniform(-5, 5), "z": tune.choice(["a","b","c"])}, metric="score", mode="min", num_samples=args.trials, search_alg=BasicVariantGenerator(), resources_per_trial={"cpu": 1}, verbose=1, callbacks=[RssReporter(args.report_interval, process)], ) print(f"Final RSS: {process.memory_info().rss / 1e9:.3f} GB") ray.shutdown() if __name__ == "__main__": main() ``` ## Observed behavior ``` $ python test_ray_tune.py --trials 4000 --report-interval 100 |grep RSS trial RSS_GB trial#100: RSS=0.672 GB trial#200: RSS=0.680 GB trial#300: RSS=0.686 GB trial#400: RSS=0.692 GB trial#500: RSS=0.702 GB trial#600: RSS=0.704 GB .... trial#1000: RSS=1.084 GB ``` RSS grows monotonically with trial count. Each batch of 100 completed trials adds a roughly constant amount of memory, suggesting per-trial state is retained and never released. ## Expected behavior For lightweight trials that each report a single scalar, the driver process RSS should remain relatively stable after initial ramp-up, not grow linearly to the point of OOM. ## Impact For workloads with thousands of trials (hyperparameter sweeps, NAS, etc.), the driver process can OOM before all trials complete. This blocks production hyperparameter search runs. ### Versions / Dependencies - **Ray version:** 2.55.1 - **Python:** 3.10.13 - **OS:** Debian GNU/Linux 12 (bookworm) ### Reproduction script ```python """ python test_ray_tune.py [--trials 4000] [--report-interval 100] """ import argparse, os, psutil, ray from ray import tune from ray.tune import Callback from ray.tune.search.basic_variant import BasicVariantGenerator def parse_args(): p = argparse.ArgumentParser() p.add_argument("--trials", type=int, default=4000) p.add_argument("--report-interval", type=int, default=100) return p.parse_args() class RssReporter(Callback): def __init__(self, interval, process=None): self.interval = interval self.process = process self.count = 0 def on_trial_complete(self, iteration, trials, trial, **info): self.count += 1 if self.count % self.interval: return rss = self.process.memory_info().rss if self.process else 0 print(f"trial#{self.count}: RSS={rss / 1e9:.3f} GB") def train_fn(config): x = config["x"]; y = config["y"]; z = config["z"] tune.report({"score": x**2 + y**2 + (ord(z) - 97)}) def main(): args = parse_args() process = psutil.Process(os.getpid()) tune.run( train_fn, config={"x": tune.uniform(-10, 10), "y": tune.uniform(-5, 5), "z": tune.choice(["a","b","c"])}, metric="score", mode="min", num_samples=args.trials, search_alg=BasicVariantGenerator(), resources_per_trial={"cpu": 1}, verbose=1, callbacks=[RssReporter(args.report_interval, process)], ) print(f"Final RSS: {process.memory_info().rss / 1e9:.3f} GB") ray.shutdown() if __name__ == "__main__": main() ``` ### Issue Severity High: It blocks me from completing my task.

Environment

Ray version: 2.55.1
Python: 3.10.13
OS: Debian GNU/Linux 12 (bookworm)

Description

When running a large number of trials via tune.run() in a single process (local mode, resources_per_trial={"cpu": 1}), the driver process RSS grows linearly with the number of completed trials, eventually leading to OOM.

Reproduction

"""
python test_ray_tune.py [--trials 4000] [--report-interval 100]
"""
import argparse, os, psutil, ray
from ray import tune
from ray.tune import Callback
from ray.tune.search.basic_variant import BasicVariantGenerator

def parse_args():
    p = argparse.ArgumentParser()
    p.add_argument("--trials", type=int, default=4000)
    p.add_argument("--report-interval", type=int, default=100)
    return p.parse_args()

class RssReporter(Callback):
    def __init__(self, interval, process=None):
        self.interval = interval
        self.process = process
        self.count = 0
    def on_trial_complete(self, iteration, trials, trial, **info):
        self.count += 1
        if self.count % self.interval: return
        rss = self.process.memory_info().rss if self.process else 0
        print(f"trial#{self.count}: RSS={rss / 1e9:.3f} GB")

def train_fn(config):
    x = config["x"]; y = config["y"]; z = config["z"]
    tune.report({"score": x**2 + y**2 + (ord(z) - 97)})

def main():
    args = parse_args()
    process = psutil.Process(os.getpid())

    tune.run(
        train_fn,
        config={"x": tune.uniform(-10, 10), "y": tune.uniform(-5, 5), "z": tune.choice(["a","b","c"])},
        metric="score", mode="min",
        num_samples=args.trials,
        search_alg=BasicVariantGenerator(),
        resources_per_trial={"cpu": 1}, verbose=1,
        callbacks=[RssReporter(args.report_interval, process)],
    )
    print(f"Final RSS: {process.memory_info().rss / 1e9:.3f} GB")
    ray.shutdown()

if __name__ == "__main__":
    main()

Observed behavior

$ python test_ray_tune.py --trials 4000 --report-interval 100 |grep RSS
   trial     RSS_GB
trial#100: RSS=0.672 GB
trial#200: RSS=0.680 GB
trial#300: RSS=0.686 GB
trial#400: RSS=0.692 GB
trial#500: RSS=0.702 GB
trial#600: RSS=0.704 GB
....
trial#1000: RSS=1.084 GB

RSS grows monotonically with trial count. Each batch of 100 completed trials adds a roughly constant amount of memory, suggesting per-trial state is retained and never released.

Expected behavior

For lightweight trials that each report a single scalar, the driver process RSS should remain relatively stable after initial ramp-up, not grow linearly to the point of OOM.

Impact

For workloads with thousands of trials (hyperparameter sweeps, NAS, etc.), the driver process can OOM before all trials complete. This blocks production hyperparameter search runs.

Versions / Dependencies

Ray version: 2.55.1
Python: 3.10.13
OS: Debian GNU/Linux 12 (bookworm)

Reproduction script

"""
python test_ray_tune.py [--trials 4000] [--report-interval 100]
"""
import argparse, os, psutil, ray
from ray import tune
from ray.tune import Callback
from ray.tune.search.basic_variant import BasicVariantGenerator

def parse_args():
    p = argparse.ArgumentParser()
    p.add_argument("--trials", type=int, default=4000)
    p.add_argument("--report-interval", type=int, default=100)
    return p.parse_args()

class RssReporter(Callback):
    def __init__(self, interval, process=None):
        self.interval = interval
        self.process = process
        self.count = 0
    def on_trial_complete(self, iteration, trials, trial, **info):
        self.count += 1
        if self.count % self.interval: return
        rss = self.process.memory_info().rss if self.process else 0
        print(f"trial#{self.count}: RSS={rss / 1e9:.3f} GB")

def train_fn(config):
    x = config["x"]; y = config["y"]; z = config["z"]
    tune.report({"score": x**2 + y**2 + (ord(z) - 97)})

def main():
    args = parse_args()
    process = psutil.Process(os.getpid())

    tune.run(
        train_fn,
        config={"x": tune.uniform(-10, 10), "y": tune.uniform(-5, 5), "z": tune.choice(["a","b","c"])},
        metric="score", mode="min",
        num_samples=args.trials,
        search_alg=BasicVariantGenerator(),
        resources_per_trial={"cpu": 1}, verbose=1,
        callbacks=[RssReporter(args.report_interval, process)],
    )
    print(f"Final RSS: {process.memory_info().rss / 1e9:.3f} GB")
    ray.shutdown()

if __name__ == "__main__":
    main()

Issue Severity

High: It blocks me from completing my task.

RunLLM · June 28, 2026, 8:16am

This is a known issue: in Ray Tune (e.g., v2.55.1), the driver process’s RSS grows linearly with the number of trials because per-trial state (Trial objects, metadata, and experiment checkpoints) is retained in memory and not released, even after trials complete. This is partly by design—Tune keeps all Trial objects and their metadata for result reporting and checkpointing, which inherently scales with the number of trials and can lead to OOM for large sweeps. Some memory leaks in loggers and trial decision queues have been fixed, but the fundamental linear growth remains unless the architecture is changed to spill completed trial state to disk instead of memory. See the detailed discussion and proposed fixes in Ray GitHub Issue #64231 and PR #64335.

If you do not need to keep all trial results in memory (e.g., you do not use ExperimentAnalysis/ResultGrid), there is currently no built-in way to disable this retention. Workarounds like manually clearing trial metadata or using callbacks to clean up state have only limited effect, as the controller and checkpointing logic still hold references. For truly flat memory usage, a deeper architectural change is needed, which is not yet implemented. Would you like a step-by-step breakdown of the memory retention points and possible mitigations?

Sources:

Hint: Mention @Herald in the post for followups.

zhangsikai123 · June 28, 2026, 8:22am

@RunLLM When running on multi node cluster, the leak could be 0.1MB per trial, I highly doubt metadata and json could contribute such a huge leak.

zhangsikai123 · June 28, 2026, 8:55am

xxxxx Who is @Herald BTW?

Topic		Replies	Views
Ray using so much memory I cannot even start the tuning Ray Tune	5	2552	April 24, 2023
Object Spilling useful to avoid running out of memory when using Ray Tune Ray Core	13	1081	March 4, 2021
Ray tune self terminates at 98 trials consistently Ray Tune	12	1492	March 15, 2023
Adding memory in resources_per_trial in tune.run() hangs	2	471	October 28, 2022
Memory explosion with TuneSearchCV Ray Tune	6	609	February 19, 2021