Ignore user-defined signal

Hi,

Not quite sure which channel to put this in.

I am using Ray Tune on a Ray Cluster with Slurm, and I set up the Ray Cluster on Slurm following Deploying on Slurm — Ray v2.0.0.dev0.

I am using a user-defined signal to requeue the slurm batch job before timeout using: #SBATCH --signal=USR1@300

which is captured by

signal.signal(signal.SIGUSR1, self.sig_handler)

which triggers requeing:

def sig_handler(self, signum, frame):
        print(f"Caught signal: {signum}")
        
        job_id = os.environ['SLURM_JOB_ID']
        cmd = 'scontrol requeue {}'.format(job_id)

        print(f'\nRequeing job {job_id}...')
        result = call(cmd, shell=True)
        if result == 0:
            print(f'Requeued exp {job_id}')
        else:
            print('Requeue failed...')

        os._exit(0)

To requeue the job. This works fine without Ray. However, Ray captures SIGUSR1 before this and throws an error:

2021-09-02 11:58:32,917	ERROR worker.py:475 -- print_logs: Connection closed by server.
2021-09-02 11:58:32,917	ERROR import_thread.py:88 -- ImportThread: Connection closed by server.
2021-09-02 11:58:32,921	ERROR worker.py:1217 -- listen_error_messages_raylet: Connection closed by server.
srun: error: b004: task 0: User defined signal 1

So is there a way to get Ray to ignore SIGUSR1 or SIGUSR2?

I think it is because we register our own signal handler, and that’s probably invoked before yours is invoked.

Can you create a feature request to our Github?

Sure, I’ll recreate the issue on GitHub