Hi,
Not quite sure which channel to put this in.
I am using Ray Tune on a Ray Cluster with Slurm, and I set up the Ray Cluster on Slurm following Deploying on Slurm — Ray v2.0.0.dev0.
I am using a user-defined signal to requeue the slurm batch job before timeout using: #SBATCH --signal=USR1@300
which is captured by
signal.signal(signal.SIGUSR1, self.sig_handler)
which triggers requeing:
def sig_handler(self, signum, frame):
print(f"Caught signal: {signum}")
job_id = os.environ['SLURM_JOB_ID']
cmd = 'scontrol requeue {}'.format(job_id)
print(f'\nRequeing job {job_id}...')
result = call(cmd, shell=True)
if result == 0:
print(f'Requeued exp {job_id}')
else:
print('Requeue failed...')
os._exit(0)
To requeue the job. This works fine without Ray. However, Ray captures SIGUSR1 before this and throws an error:
2021-09-02 11:58:32,917 ERROR worker.py:475 -- print_logs: Connection closed by server.
2021-09-02 11:58:32,917 ERROR import_thread.py:88 -- ImportThread: Connection closed by server.
2021-09-02 11:58:32,921 ERROR worker.py:1217 -- listen_error_messages_raylet: Connection closed by server.
srun: error: b004: task 0: User defined signal 1
So is there a way to get Ray to ignore SIGUSR1 or SIGUSR2?