Intial setup for ray on a HPC

JayTea · January 18, 2024, 8:01pm

Hi Y’all,

Sorry a head of time, if I am breaking some forum rules.
I am trying to test out the capabilities of RAY on a bar metal HPC using RHEL 8~9.

Ultimate goal is to get a ray cluster setup and have a head node running on one node and a worker node that runs a VLLM backed model on another node.

However before doing that, I am having trouble setting up the initial ray start.

I ran ray start --head and successfully started the Ray head node
I ran ray start --address=HEAD_NODE_IP:6379 --num-cpus=32 --num-gpus=1

However, shortly after it prints out Local node IP: ip_address, and after sometime (~2 mins) it prints out an error message saying RuntimeError( RuntimeError: Failed to connect to GCS.

Somethings to note,

I am currently allocating resources using slurm as this is going to be a proof of concept to push it more system wide.
Ray is installed using conda for both nodes
I am able to ping the head node from the worker node and vice versa.
Running ray status on the head node works

max_ronda · January 18, 2024, 10:45pm

Hey @JayTea , I think you might be missing a TMPDIR . Try providing --temp-dir and also --redis-password

e.g.

On head node, open a terminal and type:

ray start --head --port=6379 --num-cpus=<total-cpus> --redis-password=2173274697--temp-dir=/data1/gridtmp/ray_temp_dir --dashboard-host '0.0.0.0' --block

On worker nodes, open a terminal and type

ray start --address=HEAD_NODE_IP_ADDRESS:6379 --num-cpus=32 --redis-password=2173274697 --temp-dir=/data1/gridtmp/ray_temp_dir --block

and connect to it via:

ray.init(
    address=f"ray://{head_node_ip_address}:10001",
    log_to_driver=False,
    ignore_reinit_error=True
)

See if that works.

JayTea · January 18, 2024, 11:26pm

Hi @max_ronda,

Thanks for reaching out.

Sadly it did not work.

I got the same error RuntimeError: Failed to connect to GCS. when executing the worker node command, and therefore when I try to run the ray.init I got ConnectionError: ray client connection timeout

Also when running ray start --address=HEAD_NODE_IP_ADDRESS:6379 --num-cpus=32 --redis-password=2173274697 --temp-dir=/data1/gridtmp/ray_temp_dir --block I got the warning

--temp-dir=/data1/gridtmp/ray_temp_dir option will be ignored. --head is a required flag to use --temp-dir. temp_dir is only configurable from a head node. All the worker nodes will use the same temp_dir as a head node.

Update
Interestingly,
When I run the head node and worker node on two different slurm instances that are on the same node, the commands run correctly and I can see when I run ray status that a node was added to the head node.

JayTea · January 19, 2024, 5:34pm

Found out it was actually a firewall issue and port 6379 was not open.
To test I used nmap head_node_ip -p 6369 within the worker node and saw it was not open.

Avi · January 20, 2024, 5:57pm

Indeed, most of networks might not updated with the CIDR range with 6379.

Alternatively you can refer the following the recent ray job submission client approach this also helps you to submit the jobs:

Sample code

import ray
from ray.job_submission import JobSubmissionClient
import time

Ray cluster information

ray_head_ip = “kuberay-head-svc.kuberay.svc.cluster.local”
ray_head_port = 8265
ray_address = f"http://{ray_head_ip}:{ray_head_port}"

while True:
# Submit Ray job using JobSubmissionClient
client = JobSubmissionClient(ray_address)
job_id = client.submit_job(
entrypoint=“python run.py”,
runtime_env={
“working_dir”: “./”
},
entrypoint_num_cpus = 1,
)

print(client.__dict__)
print(f"Ray job submitted with job_id: {job_id}")
# Wait for a while to let the jobs run
time.sleep(10)

job_status = client.get_job_status(job_id)
get_job_logs = client.get_job_logs(job_id)
get_job_info = client.get_job_info(job_id)
async for lines in client.tail_job_logs(job_id):
    print(lines, end="")

# Shutdown Ray
ray.shutdown()

Topic		Replies	Views
ERROR gcs_utils.py:137 -- Failed to send request to gcs Ray Clusters	20	2658	February 11, 2022
Local Cluster - Failed to connect to GCS Ray Core	3	1676	August 21, 2023
Unable to manually start ray cluster Ray Core	2	782	April 26, 2021
Couldn't reconnect to GCS server Ray Core	2	3533	December 22, 2020
2023-03-19 13:38:56,574 WARNING utils.py:1445 -- Unable to connect to GCS at gaowei0155.69.142.146:8901 Ray Core	1	447	March 21, 2023

Intial setup for ray on a HPC

Ray cluster information

Related topics