I submit jobs with different experiments on the university Slurm Cluster. I do not want different jobs to clash somehow and submit each job on a separate node with 40 cores, claiming the whole node for the job where I am doing some rllib experiments. I noticed that on some particular nodes my jobs have started crashing soon after submition. First, I have several warnings of the type
2024-11-14 11:27:41,841 INFO worker.py:1432 -- Connecting to existing Ray cluster at address: 10.10.100.37:8000...
2024-11-14 11:27:46,864 ERROR gcs_utils.py:213 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2024-11-14 11:27:46,864 WARNING utils.py:1416 -- Unable to connect to GCS at 10.10.100.37:8000. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.
And after that, the error occurs:
python3.10/site-packages/ray/_private/utils.py", line 1432, in internal_kv_get_with_retry
raise ConnectionError(
ConnectionError: Could not read 'session_name' from GCS. Did GCS start successfully?
I suspect that Ray does not shut down after some of my previous jobs (for instance, if they crashed for some reason) but I do not have access to it in the next job running on this node.
Is there any way to shut down any Ray clusters running on the current node before I initialize Ray for the new experiment?