Running on individual node on Slurm Cluster

Elena · November 14, 2024, 1:41pm

I submit jobs with different experiments on the university Slurm Cluster. I do not want different jobs to clash somehow and submit each job on a separate node with 40 cores, claiming the whole node for the job where I am doing some rllib experiments. I noticed that on some particular nodes my jobs have started crashing soon after submition. First, I have several warnings of the type

2024-11-14 11:27:41,841 INFO worker.py:1432 -- Connecting to existing Ray cluster at address: 10.10.100.37:8000...
2024-11-14 11:27:46,864 ERROR gcs_utils.py:213 -- Failed to connect to GCS. Please check `gcs_server.out` for more details.
2024-11-14 11:27:46,864 WARNING utils.py:1416 -- Unable to connect to GCS at 10.10.100.37:8000. Check that (1) Ray GCS with matching version started successfully at the specified address, and (2) there is no firewall setting preventing access.

And after that, the error occurs:

python3.10/site-packages/ray/_private/utils.py", line 1432, in internal_kv_get_with_retry
    raise ConnectionError(
ConnectionError: Could not read 'session_name' from GCS. Did GCS start successfully?

I suspect that Ray does not shut down after some of my previous jobs (for instance, if they crashed for some reason) but I do not have access to it in the next job running on this node.

Is there any way to shut down any Ray clusters running on the current node before I initialize Ray for the new experiment?

Elena · November 15, 2024, 11:07am

I have found a way to fix the problem. It turns out that for some reason the /tmp/ray/ray_current_cluster file was not deleted after some of the previous tests and I am trying to connect to the ray cluster which does not exist. Now I call the function

ray._private.utils.reset_ray_address()

at the beginning of the test to make sure that the file holding the address of the previous cluster is deleted. I’m still not sure why it is happening on some particular nodes, but I hope this will help somebody solve a similar problem.

Topic		Replies	Views
Running ray on supercomputer with slurm Ray Core	4	718	August 4, 2021
Shared connection closed using Ray Cluster Launch on local Ray Clusters	1	783	December 12, 2023
Ray on slurm - different ip addresses of worker nodes Ray Clusters	14	3572	August 29, 2023
Start Ray cluster with error but working Ray Clusters	15	1046	July 4, 2022
Ray head crashed silently Ray Clusters	6	97	September 25, 2024

Running on individual node on Slurm Cluster

Related topics