Running on individual node on Slurm Cluster

I submit jobs with different experiments on the university Slurm Cluster. I do not want different jobs to clash somehow and submit each job on a separate node with 40 cores, claiming the whole node for the job where I am doing some rllib experiments. I noticed that on some particular nodes my jobs have started crashing soon after submition. First, I have several warnings of the type

2024-11-14 11:27:41,841 INFO worker.py:1432 -- Connecting to existing Ray cluster at address: 10.10.100.37:8000...
2024-11-14 11:27:46,864 ERROR gcs_utils.py:213 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2024-11-14 11:27:46,864 WARNING utils.py:1416 -- Unable to connect to GCS at 10.10.100.37:8000. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.

And after that, the error occurs:

python3.10/site-packages/ray/_private/utils.py", line 1432, in internal_kv_get_with_retry
    raise ConnectionError(
ConnectionError: Could not read 'session_name' from GCS. Did GCS start successfully?

I suspect that Ray does not shut down after some of my previous jobs (for instance, if they crashed for some reason) but I do not have access to it in the next job running on this node.

Is there any way to shut down any Ray clusters running on the current node before I initialize Ray for the new experiment?

I have found a way to fix the problem. It turns out that for some reason the /tmp/ray/ray_current_cluster file was not deleted after some of the previous tests and I am trying to connect to the ray cluster which does not exist. Now I call the function

ray._private.utils.reset_ray_address()

at the beginning of the test to make sure that the file holding the address of the previous cluster is deleted. I’m still not sure why it is happening on some particular nodes, but I hope this will help somebody solve a similar problem.