Hello, I was working with tune.run
with a custom objective function and was attempting to do random search on ~1000 trials. Everything seemed to be working but eventually it seemed that some trials/workers would hang on “RUNNING”. On the first run this resulted in all 4 concurrent trials (total resources I was dedicating at the time) locking at once, and the run never advanced. The second time it reached the end with 3 out of 4 locked, but the one frozen trial never completed.
I looked through documentation to see if it was possible to set a timeout period, but only found time_budget_s
which controlled the timeout for the entire run. I attempted to fiddle with schedulers and trial executor classes, but nothing seemed to give me the ability to directly exit a trial with an error after some timeout period.
I was wondering if there was something I missed or if there is something else I could leverage to gain this functionality. I would appreciate any guidance, thanks.
EDIT: Wanted to add that there isn’t any reason I see that my objective would fail to complete. The trials should be taking ~5sec/each to complete but as I explained above, this is not the case for all of them.