Progressive Slowdown and Deadlock in Ray Remote Tasks During Black-Box Optimization

How severe does this issue affect your experience of using Ray?
High: It blocks me to complete my task.

Description

I am encountering a critical issue when using @ray.remote tasks to run black-box optimization campaigns in parallel. Each campaign is set up as a remote task. The following pseudocode outlines the workflow within each remote task:

1. Instantiate alpha #a BoTorch model and Acquisition Function (see https://botorch.org/tutorials/closed_loop_botorch_only).
2. Instantiate black box f(x) (input is a tensor with shape (3,5) and output is a tensor with shape (3,1))
3. X, Y = []  # Initialize empty dataset; X is the input values of f(x), and Y is the output values.
4. For i in range(budget):
   a. alpha.train(X, Y)  # Train alpha with X and Y
   b. x = botorch.optimize_acqf(alpha)  # Optimize alpha to get new recommended x value
   c. y = f(x)  # Call black box f(x)
   d. X.append(x), Y.append(y)  # Augment dataset

Outside, I use ray.wait() to collect tasks that finished and process them.

Problem
Initially, I’m able to run several jobs, each consisting of 120 tasks, with no issues. Each task runs on the CPU and requires 2 CPUs and roughly 3 minutes to complete correctly. However, as I continue to launch more jobs, the tasks get progressively slower. All the tasks in subsequent jobs hang indefinitely (e.g., over 3 hours).

After some analysis, I identified that the deadlocking occurs within botorch.optimize_acqf during the forward pass, as confirmed by a flamegraph (see below). However, what’s perplexing is that this issue only manifests in later benchmarks, not in the initial ones. This problem temporarily resolves if I run ray stop -f followed by ray start, suggesting a potential state accumulation issue.

Hypotheses & Questions:

  • Memory Leak: Could there be a memory leak within Ray or the BoTorch/PyTorch setup?
  • State Accumulation: Could this be related to an internal state accumulation in Ray or PyTorch over time?
  • Garbage Collection: Is there a possibility that this could be resolved by invoking Python’s garbage collector manually? I noticed similarities to issues like PyTorch Issue #95462.

Request:
I would appreciate any guidance or ideas on what to try next. Specifically:

  • Any debugging techniques or tools within Ray that could help further isolate the problem.
  • Suggestions on how to manage or avoid state accumulation within Ray workers.
  • Recommendations on how to better utilize the garbage collector in this context.

Keywords:
Bayesian Optimization, Memory Leak, Deadlock, PyTorch, BoTorch, Garbage Collector, GC, Gaussian Process, Acquisition Function

Thanks in advance for your help!

Seems to be fixed by manually calling gc.collect() at the end of the task.