Ray Tune gets stuck for infinity

1. Severity of the issue: (select one)
[ x] High: Completely blocks me.

2. Environment:

  • Ray version: 2.37.0
  • Python version: 3.12
  • OS: Windows
  • Other libs/tools (if relevant): Ligthgbm

3. What happened vs. what you expected:

  • Expected: Hyperparameter search for a gradient boosted regression tree using ray tune and lightgbm
  • Actual: Sometimes I start the training and the first trial starts running, but nothing happens. I just get the print with the running information of the trial every 30s with status running. However my target function never seems to be called at all as the first line in this function is a print, which never appears on screen. I need to run multiple hyperparameter optimization tasks on slightly different data (different months, but data structure and distribution is equal). Weirdly this problem only arises never on some of the data and on the data it arises it only arises in like 90% of the runs. My code is below:
def trainGBRT(config, X_train, X_val, y_train, y_val):
    try:
        # LightGBM Dataset
        print("StartTrain", flush=True)
        train_data = lgb.Dataset(X_train, label=y_train, params={'max_bin': 63})
        val_data = lgb.Dataset(X_val, label=y_val, params={'max_bin': 63}, reference=train_data)
        print("Dataset", flush=True)
        # Train the model
        model = lgb.train(config, train_data, valid_sets=[val_data], valid_names=['eval'], callbacks=[lgb.early_stopping(stopping_rounds=25),
                TuneReportCheckpointCallback(
                    {
                        "l2": "eval-l2"
                    }, frequency=1, checkpoint_at_end=True
                )
            ],)
        print("model", flush=True)
        # Make predictions
        y_pred = model.predict(X_val)
        print("pred", flush=True)
        # Evaluate the model
        mse = mean_squared_error(y_val, y_pred)
        print("mse", flush=True)
        with tempfile.TemporaryDirectory() as temp_checkpoint_dir:
            path = os.path.join(temp_checkpoint_dir, "model.txt")
            model.save_model(path)
            checkpoint = Checkpoint.from_directory(temp_checkpoint_dir)
            session.report({"l2": mse, 'done': True}, checkpoint=checkpoint)
    except Exception as e:
        import traceback
        print("Exception occurred:", e)
        traceback.print_exc()
        raise

and

config = {
                'objective': 'regression',
                'metric': 'l2',
                'boosting_type': 'gbdt',
                'num_iterations': 500,
                'num_leaves': tune.randint(10, 250),
                'max_depth': tune.randint(3, 10),
                'learning_rate': tune.choice([0.01, 0.1, 1]),
                'feature_fraction': tune.uniform(0.25, 1),
                'bagging_fraction': tune.uniform(0.25, 1),
                'bagging_freq': tune.choice([1, 5, 10]),
                'feature_penalty': list(feature_penalties),
                'lambda_l1': tune.uniform(0, 0.1),
                'lambda_l2': tune.uniform(0, 0.1),
                'num_threads': 8,
                'device_type': 'gpu',
                'verbose': 2
        }
        print("Tune")

        tuner = tune.Tuner(
            tune.with_resources(
                tune.with_parameters(trainGBRT, X_train=X_train, X_val=X_val, y_train=y_train, y_val=y_val),
                resources={"cpu": 8, "gpu": 1}),
            tune_config=tune.TuneConfig(
                metric="l2",
                mode="min",
                scheduler=ASHAScheduler(max_t=config['num_iterations']),
                search_alg=OptunaSearch(),
                num_samples=50,
                max_concurrent_trials=1,
            ),
            param_space=config,
        )
        print(ray.cluster_resources())

        print("Tune Fit")
        results = tuner.fit()

Hi uiag welcome to the Ray community!

So, you don’t see any sort of errors at all? It just hangs there infinitely?
Is there any output at all from where you’re running Ray from (the terminal)?

Yes it hangs there for infinity. I get following output:

Trial trainGBRT_7d75e5da started with configuration:
╭──────────────────────────────────────────────────────────╮
│ Trial trainGBRT_7d75e5da config                          │
├──────────────────────────────────────────────────────────┤
│ bagging_fraction                                 0.35029 │
│ bagging_freq                                          10 │
│ boosting_type                                       gbdt │
│ device_type                                          gpu │
│ feature_fraction                                 0.50953 │
│ feature_penalty                     ...0, 1.0, 1.0, 1.0] │
│ lambda_l1                                        0.07351 │
│ lambda_l2                                         0.0819 │
│ learning_rate                                       0.01 │
│ max_depth                                              3 │
│ metric                                                l2 │
│ num_iterations                                       500 │
│ num_leaves                                           213 │
│ num_threads                                            8 │
│ objective                                     regression │
│ verbose                                                2 │
╰──────────────────────────────────────────────────────────╯

Trial status: 1 RUNNING
Current time: 2025-05-05 21:34:13. Total running time: 30s
Logical resource usage: 8.0/8 CPUs, 1.0/1 GPUs (0.0/1.0 accelerator_type:G)
╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name           status       num_leaves     max_depth     learning_rate     feature_fraction     bagging_fraction     bagging_freq     lambda_l1     lambda_l2 │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ trainGBRT_7d75e5da   RUNNING             213             3              0.01             0.509525              0.35029               10      0.073515     0.0819017 │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

The trial status running block is then printed every 30s. I already let it run multiple hours and nothing changed.
Instead if training works and continues I see following prints after a couple of seconds:

(trainGBRT pid=26180) StartTrain
(trainGBRT pid=26180) Dataset

(trainGBRT pid=26180) [LightGBM] [Info] This is the GPU trainer!!
(trainGBRT pid=26180) [LightGBM] [Debug] Dataset::GetMultiBinFromSparseFeatures: sparse rate 0.657855
(trainGBRT pid=26180) [LightGBM] [Info] Total Bins 18143
(trainGBRT pid=26180) [LightGBM] [Info] Number of data points in the train set: 7725929, number of used features: 291
(trainGBRT pid=26180) [LightGBM] [Info] Using GPU Device: NVIDIA GeForce RTX 3070, Vendor: NVIDIA Corporation
(trainGBRT pid=26180) [LightGBM] [Info] Compiling OpenCL Kernel with 64 bins...
(trainGBRT pid=26180) [LightGBM] [Info] GPU programs have been built
(trainGBRT pid=26180) [LightGBM] [Info] Size of histogram bin entry: 8
(trainGBRT pid=26180) [LightGBM] [Info] 288 dense feature groups (2121.99 MB) transferred to GPU in 0.778445 secs. 1 sparse feature groups
(trainGBRT pid=26180) [LightGBM] [Info] Start training from score 0.000004
(trainGBRT pid=26180) [LightGBM] [Debug] Re-bagging, using 2708706 data to train
(trainGBRT pid=26180) [LightGBM] [Warning] No further splits with positive gain, best gain: -inf
(trainGBRT pid=26180) [LightGBM] [Debug] Trained a tree with leaves = 8 and depth = 3

So, if it hangs, not even the “StartTrain” in the trainGBRT function is printed. I also tried using only the CPU for training, but this didn’t help.

Hi! Thanks for sending that to me! So, Ray marks a trial RUNNING as soon as the worker process has been scheduled. I see it is running successfully which is good.

This caught my eye tho: Number of data points in the train set: 7725929, number of used features: 291

With ~7.7 million rows * 291 columns that single NumPy array is far larger than the default per‑node object‑store on a typical workstation. That plus passing X_train, X_val, … by value inside tune.with_parameters forces Ray to make another full copy for every trial. Until the objects have been pulled from the object‑store though, the first Python line in trainGBRT() won’t execute, so your print("StartTrain") never shows up. With such a big array, your store is filled up before the worker can pull the data, so the worker process is alive but blocked, and Tune just re‑prints the RUNNING table every 30. (This is what I think is happening, at least.)

This might explain why it works on smaller months too. Here’s some docs where they kinda discuss this:

Can you try using Ray Put to see if that helps with this issue? It will pass it by reference instead of making a copy every trial which might help. This is Ray’s recommended pattern for “large constant data” mentioned in the docs here:

Also, does your code work if you use a much smaller array?

Thank you, that seems reasonable. Sadly using ray.put didn’t help either.

I don’t think this is the problem as it does also work with a lot more data than the 7.7million * 291 rows. To be precisely I use monthly data. I use either 3 months, 4 months, 5 months, …, or 10 months as training data. It currently doesn’t work using 5 or 6 months. But using 10 months (which is a lot more data and also includes the data we use for 6 months) works perfectly fine.

:thinking: let me do some more digging and see if there’s anything I can do. I guess it would be difficult to ask for a reproduction script due to the size of the data - is there any way you can repro with a much smaller dataset that I can run locally?

Can you try debugging a bit with these steps and let me know what you see?

While it’s stuck in the Ray Tune can you run ray stack in another shell and let me know what it prints out? This takes a dump of all the Python workers on your machine. Running ray memory might help out too (or ray memory --stats-only).

ray stack:

2025-05-05 23:45:37,281 - INFO - NumExpr defaulting to 8 threads.

ray memory and also ray memory --stats-only print the same and then run into following error:

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.DEADLINE_EXCEEDED
        details = "Deadline Exceeded"
        debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"Deadline Exceeded", grpc_status:4, created_time:"2025-05-05T21:49:23.4907466+00:00"}"
>

ray timeline additionally prints:

2025-05-05 23:50:41,291 INFO scripts.py:1893 -- Connecting to Ray instance at 127.0.0.1:61490.
2025-05-05 23:50:41,291 INFO worker.py:1601 -- Connecting to existing Ray cluster at address: 127.0.0.1:61490...
2025-05-05 23:50:41,303 INFO worker.py:1777 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 

ray status gives:

2025-05-05 23:55:05,659 - INFO - NumExpr defaulting to 8 threads.
======== Autoscaler status: 2025-05-05 23:55:03.468997 ========
Node status
---------------------------------------------------------------
Active:
 1 node_f7df8c1e2359c66c5c70c2559a69fbecc413f48deb44ea59931a7933
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 8.0/8.0 CPU (8.0 used of 8.0 reserved in placement groups)
 1.0/1.0 GPU (1.0 used of 1.0 reserved in placement groups)
 0B/18.62GiB memory
 4.17GiB/9.31GiB object_store_memory

Demands:
 (no resource demands)

ray debug shows the same as ray timeline.
Furthermore I can’t reach the Dashboard under 127.0.0.1:8265

Regarding your earlier message: I could share my whole script with you (but it isn’t much more than the two functions I have put in my first post). However I won’t be able to share any of the data as I am not allowed to do.