Increasing amount of Backup_poller errors

Hi, I’m running optuna with Ray 1.13, grpc 1.43.0 and getting increasing number of backup_poller errors that eventually hang my trials.

Also it seems like over time training slows down significantly.

I’ve looked in Ray issues didn’t see anything specifically related, wondering if anyone has seen this.

Thank you!

E0815 14:26:06.967181484 1842238] Run client channel backup poller: {“created”:“@1660587966.967090153”,“description”:“pollset_work”,“file”:“src/core/lib/iomgr/”,“file_line”:320,“referenced_errors”:[{“created”:“@1660587966.967085911”,“description”:“Bad file descriptor”,“errno”:9,“file”:“src/core/lib/iomgr/”,“file_line”:950,“os_error”:“Bad file descriptor”,“syscall”:“epoll_wait”}]}

Do you have a repro script?

Thank you jjyao! I’ll put a repro script together.

I managed to trace this to be specific to agent train.trainer.

Train trainer spins up TensorboardX process, ray util, and QueueFeederThread but it seems like ray.shutdown() after a train does not shut it down until the main process ends, since we are doing a training trial loop with optuna (each trial loop has a ray.init and ray.shutdown, these 3 processes never close on shutdown), and when trainer is called again a new set of three threads are spanned, dont get shut down and fork bomb.


@xwjiang2010 seems a ML question.

yeah a repro script would be helpful. happy to take a look then!

Working on getting a repro script together.

We were able to fix the two leaking threads by:

  1. Uninstalling tensorboardX
  2. setting “log_sys_usage” to False

ray/ was the offending code at least on #2, is ignoring ray.shutdown

Maybe something with psutils?