Increasing amount of Backup_poller errors

andyjh122 · August 15, 2022, 7:26pm

Hi, I’m running optuna with Ray 1.13, grpc 1.43.0 and getting increasing number of backup_poller errors that eventually hang my trials.

Also it seems like over time training slows down significantly.

I’ve looked in Ray issues didn’t see anything specifically related, wondering if anyone has seen this.

Thank you!
Andrew

E0815 14:26:06.967181484 1842238 backup_poller.cc:134] Run client channel backup poller: {“created”:“@1660587966.967090153”,“description”:“pollset_work”,“file”:“src/core/lib/iomgr/ev_epollex_linux.cc”,“file_line”:320,“referenced_errors”:[{“created”:“@1660587966.967085911”,“description”:“Bad file descriptor”,“errno”:9,“file”:“src/core/lib/iomgr/ev_epollex_linux.cc”,“file_line”:950,“os_error”:“Bad file descriptor”,“syscall”:“epoll_wait”}]}

jjyao · August 17, 2022, 12:13am

Do you have a repro script?

andyjh122 · August 18, 2022, 6:08pm

Thank you jjyao! I’ll put a repro script together.

I managed to trace this to be specific to agent train.trainer.

Train trainer spins up TensorboardX process, ray util, and QueueFeederThread but it seems like ray.shutdown() after a train does not shut it down until the main process ends, since we are doing a training trial loop with optuna (each trial loop has a ray.init and ray.shutdown, these 3 processes never close on shutdown), and when trainer is called again a new set of three threads are spanned, dont get shut down and fork bomb.

Andrew

jjyao · August 19, 2022, 7:33am

@xwjiang2010 seems a ML question.

xwjiang2010 · August 19, 2022, 5:09pm

yeah a repro script would be helpful. happy to take a look then!

andyjh122 · August 23, 2022, 8:06pm

Working on getting a repro script together.

We were able to fix the two leaking threads by:

Uninstalling tensorboardX
setting “log_sys_usage” to False

ray/utils.py was the offending code at least on #2, is ignoring ray.shutdown

Maybe something with psutils?

Topic		Replies	Views
Nightly build ray crashes after few training iterations using RLLib Ray Core	2	405	February 11, 2022
Bug in tuner.restore / optuna_search	1	277	August 16, 2023
Error: grpc._channel._InactiveRpcError: <_InactiveRpcError of RP Ray Train	3	1127	April 10, 2023
Actor died unexpectedly (GrpcUnavailable: failed to connect to all addresses) RLlib	4	2509	July 5, 2022
SAC trainer slows down drastically RLlib	6	670	May 29, 2022

Increasing amount of Backup_poller errors

Related topics