Lightgbm_ray train() crashes with additional arguments

My lightgbm_ray crashes when I use extra argument early_stopping_rounds=10. Here is a code example.

The following one runs smoothly.

from lightgbm_ray import RayDMatrix, RayParams, RayFileType
from lightgbm_ray import train as Train
from lightgbm_ray import predict as Predict
NUM_ACTORS = 4
CPUS_PER_ACTOR = 2

dtrain = RayDMatrix(
    train_path,
    label="y",  
    columns=COLUMNS, 
    filetype=RayFileType.PARQUET,
    distributed=True)
dval = ...
dtest = ...

lgb_params = ...

evals_result = {}
bst = Train(
    lgb_params,
    dtrain,
    num_boost_round=300,
    evals_result=evals_result,
    valid_sets=[dtrain, dval, dtest],
    valid_names=["train", "validation", "test"],
    verbose_eval=True,
    ray_params=RayParams(num_actors=NUM_ACTORS, cpus_per_actor=CPUS_PER_ACTOR),
    # early_stopping_rounds=10, # THIS MAKES IT CRASH, ISSUE POSTED IN RAY FORUM
    )

But it crashes with the following modification.

bst = Train(
    lgb_params,
    dtrain,
    num_boost_round=300,
    evals_result=evals_result,
    valid_sets=[dtrain, dval, dtest],
    valid_names=["train", "validation", "test"],
    verbose_eval=True,
    ray_params=RayParams(num_actors=NUM_ACTORS, cpus_per_actor=CPUS_PER_ACTOR),
    early_stopping_rounds=10, # THIS MAKES IT CRASH, ISSUE POSTED IN RAY FORUM
    )

The error:

...
(_RemoteRayLightGBMActor pid=None, ip=172.30.241.250) [19]      train's l2: 4193.85     validation's l2: 14344  test's l2: 2587.98
(_RemoteRayLightGBMActor pid=None) [19] train's l2: 3460.62     validation's l2: 11457.6        test's l2: 2100.86
(_RemoteRayLightGBMActor pid=None) [20] train's l2: 3422.48     validation's l2: 11414.8        test's l2: 2072.84
(_RemoteRayLightGBMActor pid=None, ip=172.30.241.253) [19]      train's l2: 4109.12     validation's l2: 17701.9        test's l2: 2293.46
(_RemoteRayLightGBMActor pid=None, ip=172.30.241.253) [20]      train's l2: 4056.87     validation's l2: 17642  test's l2: 2437.54
(_RemoteRayLightGBMActor pid=None, ip=172.30.11.231) [20]       train's l2: 3485.36     validation's l2: 9115.94        test's l2: 2114.12
(_RemoteRayLightGBMActor pid=None, ip=172.30.241.250) [20]      train's l2: 4138.02     validation's l2: 14287.5        test's l2: 2723.57
(_RemoteRayLightGBMActor pid=None, ip=172.30.241.250) [21]      train's l2: 4089.48     validation's l2: 14236  test's l2: 2747.22
(_RemoteRayLightGBMActor pid=None) [21] train's l2: 3388.29     validation's l2: 11372.9        test's l2: 2076.71
(_RemoteRayLightGBMActor pid=None) [LightGBM] [Info] Finished linking network in 1.266743 seconds
(pid=None) [LightGBM] [Fatal] Socket recv error, Connection reset by peer (code: 104)
(_RemoteRayLightGBMActor pid=None, ip=172.30.241.253) [21]      train's l2: 4016.31     validation's l2: 17601.8        test's l2: 2453.85
(_RemoteRayLightGBMActor pid=None, ip=172.30.241.253) [LightGBM] [Info] Finished linking network in 1.776473 seconds
(pid=None, ip=172.30.241.253) [LightGBM] [Fatal] Socket recv error, Connection reset by peer (code: 104)
2021-11-09 04:26:27,949 INFO elastic.py:156 -- Actor status: 4 alive, 0 dead (4 total)
Traceback (most recent call last):
  File "/home/ray/.local/lib/python3.7/site-packages/lightgbm_ray/main.py", line 767, in _train
    ray.get(ready)
  File "/home/ray/.local/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 89, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/.local/lib/python3.7/site-packages/ray/worker.py", line 1621, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RayXGBoostTrainingError): ray::_RemoteRayLightGBMActor.train() (pid=5078, ip=172.30.11.206, repr=<lightgbm_ray.main._RemoteRayLightGBMActor object at 0x7f6bdc661250>)
  File "/home/ray/.local/lib/python3.7/site-packages/lightgbm/sklearn.py", line 758, in fit
    callbacks=callbacks
  File "/home/ray/.local/lib/python3.7/site-packages/lightgbm/engine.py", line 293, in train
    booster.update(fobj=fobj)
  File "/home/ray/.local/lib/python3.7/site-packages/lightgbm/basic.py", line 3023, in update
    ctypes.byref(is_finished)))
  File "/home/ray/.local/lib/python3.7/site-packages/lightgbm/basic.py", line 125, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Socket recv error, Connection reset by peer (code: 104)

The above exception was the direct cause of the following exception:

ray::_RemoteRayLightGBMActor.train() (pid=5078, ip=172.30.11.206, repr=<lightgbm_ray.main._RemoteRayLightGBMActor object at 0x7f6bdc661250>)
  File "/home/ray/.local/lib/python3.7/site-packages/lightgbm_ray/main.py", line 429, in train
    raise RayXGBoostTrainingError("Training failed.") from raise_from
xgboost_ray.main.RayXGBoostTrainingError: Training failed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ray/.local/lib/python3.7/site-packages/lightgbm_ray/main.py", line 1191, in train
    **kwargs)
  File "/home/ray/.local/lib/python3.7/site-packages/lightgbm_ray/main.py", line 786, in _train
    raise RayActorError from exc
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "algorithms.py", line 98, in <module>
    early_stopping_rounds=10, # THIS MAKES IT CRASH, ISSUE POSTED IN RAY FORUM
  File "/home/ray/.local/lib/python3.7/site-packages/lightgbm_ray/main.py", line 1263, in train
    ) from exc
RuntimeError: A Ray actor died during training and the maximum number of retries (0) is exhausted.
(_RemoteRayLightGBMActor pid=None, ip=172.30.11.231) [21]       train's l2: 3452.57     validation's l2: 9074.71        test's l2: 2113.81
(_RemoteRayLightGBMActor pid=None, ip=172.30.11.231) [LightGBM] [Info] Finished linking network in 1.939171 seconds
(pid=None, ip=172.30.11.231) [LightGBM] [Fatal] Socket recv error, Connection reset by peer (code: 104)
(_RemoteRayLightGBMActor pid=None, ip=172.30.241.250) [LightGBM] [Info] Finished linking network in 1.824866 seconds

I am running it in Ray cluster mode with address='auto' in a Kubernetes cluster. Am I doing something wrong from my side? Please help.

cc @Yard1 does lightgbm-ray work with early_stopping_rounds?

It doesn’t work - there is an exception raised for sklearn API but it looks like it slipped past in the train API. Will add it

@Yard1 @amogkam I found similar behavior for custom objective function as well, i.e., using fobj=custom_loss(...). Thanks.

I cannot replicate that, and we have tests for it. Can you post your objective function? Perhaps there is some issue with it.