Thank you jjyao! I’ll put a repro script together.
I managed to trace this to be specific to agent train.trainer.
Train trainer spins up TensorboardX process, ray util, and QueueFeederThread but it seems like ray.shutdown() after a train does not shut it down until the main process ends, since we are doing a training trial loop with optuna (each trial loop has a ray.init and ray.shutdown, these 3 processes never close on shutdown), and when trainer is called again a new set of three threads are spanned, dont get shut down and fork bomb.