I’m trying to run Ray on Cori at NERSC (supercomputer; not a cluster) using. Others have done that more-or-less successfully, but I’m running too often into stability issues to consider this functional and I’m hoping for ideas/feedback to improve the situation.
I started with the NERSC provided script, then added features from the ATLAS ones, in particular the explicit synchronization instead of using statements like “sleep 30” and hoping that that is enough to start Ray on all workers. I removed running “ray stop” from the slurm scripts (which gave no end of trouble) and instead am calling “ray.shutdown()” from Python. On the head and worker nodes, I tried with different amounts of load/allocation (it’s clear that some resources need to be left to redis).
It works, just not all the time, and I suspect that errors will be more frequent with larger jobs (simply more chances for failure).
There are 4 types of error that occur with some regularity, in order of severity:
- “bind: Address already in use”. As a result, the job hangs until time is up at which point it gets killed and the logs are subsequently not flushed, so no more information.
- “_start_new_thread(self._bootstrap, ()); RuntimeError: can’t start new thread”. Happens when handing a new remote task to a multiprocessing.Pool. Exception makes it out to user code so can exit application and clean up. Bad, but at least doesn’t lose the full allocated time. (Re-submitting the task may even work?)
- “Resource temporarily unavailable”: happens at the start of processing. Ray fully recovers from this one, but the log file is flooded with messages. Regardless this error, the application run successfully to completion. So annoying, but not the end of the world. My best guess is that this is due to the socket’s buffer being full and boost’s asio simply tries again (and then succeeds).
- Log files not fully received. Logging (from Python’s logging module) does not always appear in the temporary directory. This does not affect running at all per se, but makes debugging a lot harder. It tried flushing stderr/stdout in several places, but nothing gets the log files consistently in full.
Any ideas of the cause on these and/or on how to improve the stability are appreciated!