Sorry a head of time, if I am breaking some forum rules.
I am trying to test out the capabilities of RAY on a bar metal HPC using RHEL 8~9.
Ultimate goal is to get a ray cluster setup and have a head node running on one node and a worker node that runs a VLLM backed model on another node.
However before doing that, I am having trouble setting up the initial ray start.
I ran ray start --head and successfully started the Ray head node
I ran ray start --address=HEAD_NODE_IP:6379 --num-cpus=32 --num-gpus=1
However, shortly after it prints out Local node IP: ip_address, and after sometime (~2 mins) it prints out an error message saying RuntimeError( RuntimeError: Failed to connect to GCS.
Somethings to note,
I am currently allocating resources using slurm as this is going to be a proof of concept to push it more system wide.
Ray is installed using conda for both nodes
I am able to ping the head node from the worker node and vice versa.
I got the same error RuntimeError: Failed to connect to GCS. when executing the worker node command, and therefore when I try to run the ray.init I got ConnectionError: ray client connection timeout
Also when running ray start --address=HEAD_NODE_IP_ADDRESS:6379 --num-cpus=32 --redis-password=2173274697 --temp-dir=/data1/gridtmp/ray_temp_dir --block I got the warning
--temp-dir=/data1/gridtmp/ray_temp_dir option will be ignored. --head is a required flag to use --temp-dir. temp_dir is only configurable from a head node. All the worker nodes will use the same temp_dir as a head node.
Update
Interestingly,
When I run the head node and worker node on two different slurm instances that are on the same node, the commands run correctly and I can see when I run ray status that a node was added to the head node.
Found out it was actually a firewall issue and port 6379 was not open.
To test I used nmap head_node_ip -p 6369 within the worker node and saw it was not open.
while True:
# Submit Ray job using JobSubmissionClient
client = JobSubmissionClient(ray_address)
job_id = client.submit_job(
entrypoint=“python run.py”,
runtime_env={
“working_dir”: “./”
},
entrypoint_num_cpus = 1,
)
print(client.__dict__)
print(f"Ray job submitted with job_id: {job_id}")
# Wait for a while to let the jobs run
time.sleep(10)
job_status = client.get_job_status(job_id)
get_job_logs = client.get_job_logs(job_id)
get_job_info = client.get_job_info(job_id)
async for lines in client.tail_job_logs(job_id):
print(lines, end="")
# Shutdown Ray
ray.shutdown()