My lightgbm_ray crashes when I use extra argument early_stopping_rounds=10
. Here is a code example.
The following one runs smoothly.
from lightgbm_ray import RayDMatrix, RayParams, RayFileType
from lightgbm_ray import train as Train
from lightgbm_ray import predict as Predict
NUM_ACTORS = 4
CPUS_PER_ACTOR = 2
dtrain = RayDMatrix(
train_path,
label="y",
columns=COLUMNS,
filetype=RayFileType.PARQUET,
distributed=True)
dval = ...
dtest = ...
lgb_params = ...
evals_result = {}
bst = Train(
lgb_params,
dtrain,
num_boost_round=300,
evals_result=evals_result,
valid_sets=[dtrain, dval, dtest],
valid_names=["train", "validation", "test"],
verbose_eval=True,
ray_params=RayParams(num_actors=NUM_ACTORS, cpus_per_actor=CPUS_PER_ACTOR),
# early_stopping_rounds=10, # THIS MAKES IT CRASH, ISSUE POSTED IN RAY FORUM
)
But it crashes with the following modification.
bst = Train(
lgb_params,
dtrain,
num_boost_round=300,
evals_result=evals_result,
valid_sets=[dtrain, dval, dtest],
valid_names=["train", "validation", "test"],
verbose_eval=True,
ray_params=RayParams(num_actors=NUM_ACTORS, cpus_per_actor=CPUS_PER_ACTOR),
early_stopping_rounds=10, # THIS MAKES IT CRASH, ISSUE POSTED IN RAY FORUM
)
The error:
...
(_RemoteRayLightGBMActor pid=None, ip=172.30.241.250) [19] train's l2: 4193.85 validation's l2: 14344 test's l2: 2587.98
(_RemoteRayLightGBMActor pid=None) [19] train's l2: 3460.62 validation's l2: 11457.6 test's l2: 2100.86
(_RemoteRayLightGBMActor pid=None) [20] train's l2: 3422.48 validation's l2: 11414.8 test's l2: 2072.84
(_RemoteRayLightGBMActor pid=None, ip=172.30.241.253) [19] train's l2: 4109.12 validation's l2: 17701.9 test's l2: 2293.46
(_RemoteRayLightGBMActor pid=None, ip=172.30.241.253) [20] train's l2: 4056.87 validation's l2: 17642 test's l2: 2437.54
(_RemoteRayLightGBMActor pid=None, ip=172.30.11.231) [20] train's l2: 3485.36 validation's l2: 9115.94 test's l2: 2114.12
(_RemoteRayLightGBMActor pid=None, ip=172.30.241.250) [20] train's l2: 4138.02 validation's l2: 14287.5 test's l2: 2723.57
(_RemoteRayLightGBMActor pid=None, ip=172.30.241.250) [21] train's l2: 4089.48 validation's l2: 14236 test's l2: 2747.22
(_RemoteRayLightGBMActor pid=None) [21] train's l2: 3388.29 validation's l2: 11372.9 test's l2: 2076.71
(_RemoteRayLightGBMActor pid=None) [LightGBM] [Info] Finished linking network in 1.266743 seconds
(pid=None) [LightGBM] [Fatal] Socket recv error, Connection reset by peer (code: 104)
(_RemoteRayLightGBMActor pid=None, ip=172.30.241.253) [21] train's l2: 4016.31 validation's l2: 17601.8 test's l2: 2453.85
(_RemoteRayLightGBMActor pid=None, ip=172.30.241.253) [LightGBM] [Info] Finished linking network in 1.776473 seconds
(pid=None, ip=172.30.241.253) [LightGBM] [Fatal] Socket recv error, Connection reset by peer (code: 104)
2021-11-09 04:26:27,949 INFO elastic.py:156 -- Actor status: 4 alive, 0 dead (4 total)
Traceback (most recent call last):
File "/home/ray/.local/lib/python3.7/site-packages/lightgbm_ray/main.py", line 767, in _train
ray.get(ready)
File "/home/ray/.local/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 89, in wrapper
return func(*args, **kwargs)
File "/home/ray/.local/lib/python3.7/site-packages/ray/worker.py", line 1621, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RayXGBoostTrainingError): ray::_RemoteRayLightGBMActor.train() (pid=5078, ip=172.30.11.206, repr=<lightgbm_ray.main._RemoteRayLightGBMActor object at 0x7f6bdc661250>)
File "/home/ray/.local/lib/python3.7/site-packages/lightgbm/sklearn.py", line 758, in fit
callbacks=callbacks
File "/home/ray/.local/lib/python3.7/site-packages/lightgbm/engine.py", line 293, in train
booster.update(fobj=fobj)
File "/home/ray/.local/lib/python3.7/site-packages/lightgbm/basic.py", line 3023, in update
ctypes.byref(is_finished)))
File "/home/ray/.local/lib/python3.7/site-packages/lightgbm/basic.py", line 125, in _safe_call
raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
lightgbm.basic.LightGBMError: Socket recv error, Connection reset by peer (code: 104)
The above exception was the direct cause of the following exception:
ray::_RemoteRayLightGBMActor.train() (pid=5078, ip=172.30.11.206, repr=<lightgbm_ray.main._RemoteRayLightGBMActor object at 0x7f6bdc661250>)
File "/home/ray/.local/lib/python3.7/site-packages/lightgbm_ray/main.py", line 429, in train
raise RayXGBoostTrainingError("Training failed.") from raise_from
xgboost_ray.main.RayXGBoostTrainingError: Training failed.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ray/.local/lib/python3.7/site-packages/lightgbm_ray/main.py", line 1191, in train
**kwargs)
File "/home/ray/.local/lib/python3.7/site-packages/lightgbm_ray/main.py", line 786, in _train
raise RayActorError from exc
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "algorithms.py", line 98, in <module>
early_stopping_rounds=10, # THIS MAKES IT CRASH, ISSUE POSTED IN RAY FORUM
File "/home/ray/.local/lib/python3.7/site-packages/lightgbm_ray/main.py", line 1263, in train
) from exc
RuntimeError: A Ray actor died during training and the maximum number of retries (0) is exhausted.
(_RemoteRayLightGBMActor pid=None, ip=172.30.11.231) [21] train's l2: 3452.57 validation's l2: 9074.71 test's l2: 2113.81
(_RemoteRayLightGBMActor pid=None, ip=172.30.11.231) [LightGBM] [Info] Finished linking network in 1.939171 seconds
(pid=None, ip=172.30.11.231) [LightGBM] [Fatal] Socket recv error, Connection reset by peer (code: 104)
(_RemoteRayLightGBMActor pid=None, ip=172.30.241.250) [LightGBM] [Info] Finished linking network in 1.824866 seconds
I am running it in Ray cluster mode with address='auto'
in a Kubernetes cluster. Am I doing something wrong from my side? Please help.