Intial setup for ray on a HPC

Hi Y’all,

Sorry a head of time, if I am breaking some forum rules.
I am trying to test out the capabilities of RAY on a bar metal HPC using RHEL 8~9.

Ultimate goal is to get a ray cluster setup and have a head node running on one node and a worker node that runs a VLLM backed model on another node.

However before doing that, I am having trouble setting up the initial ray start.

  1. I ran ray start --head and successfully started the Ray head node
  2. I ran ray start --address=HEAD_NODE_IP:6379 --num-cpus=32 --num-gpus=1

However, shortly after it prints out Local node IP: ip_address, and after sometime (~2 mins) it prints out an error message saying RuntimeError( RuntimeError: Failed to connect to GCS.

Somethings to note,

  1. I am currently allocating resources using slurm as this is going to be a proof of concept to push it more system wide.
  2. Ray is installed using conda for both nodes
  3. I am able to ping the head node from the worker node and vice versa.
  4. Running ray status on the head node works

Hey @JayTea , I think you might be missing a TMPDIR . Try providing --temp-dir and also --redis-password

e.g.

On head node, open a terminal and type:

ray start --head --port=6379 --num-cpus=<total-cpus> --redis-password=2173274697--temp-dir=/data1/gridtmp/ray_temp_dir --dashboard-host '0.0.0.0' --block

On worker nodes, open a terminal and type

ray start --address=HEAD_NODE_IP_ADDRESS:6379 --num-cpus=32 --redis-password=2173274697 --temp-dir=/data1/gridtmp/ray_temp_dir --block

and connect to it via:

ray.init(
    address=f"ray://{head_node_ip_address}:10001",
    log_to_driver=False,
    ignore_reinit_error=True
)

See if that works.

Hi @max_ronda,

Thanks for reaching out.

Sadly it did not work.

I got the same error RuntimeError: Failed to connect to GCS. when executing the worker node command, and therefore when I try to run the ray.init I got ConnectionError: ray client connection timeout

Also when running ray start --address=HEAD_NODE_IP_ADDRESS:6379 --num-cpus=32 --redis-password=2173274697 --temp-dir=/data1/gridtmp/ray_temp_dir --block I got the warning

--temp-dir=/data1/gridtmp/ray_temp_dir option will be ignored. --head is a required flag to use --temp-dir. temp_dir is only configurable from a head node. All the worker nodes will use the same temp_dir as a head node.

Update
Interestingly,
When I run the head node and worker node on two different slurm instances that are on the same node, the commands run correctly and I can see when I run ray status that a node was added to the head node.

Found out it was actually a firewall issue and port 6379 was not open.
To test I used nmap head_node_ip -p 6369 within the worker node and saw it was not open.

1 Like

Indeed, most of networks might not updated with the CIDR range with 6379.

Alternatively you can refer the following the recent ray job submission client approach this also helps you to submit the jobs:

Sample code

import ray
from ray.job_submission import JobSubmissionClient
import time

Ray cluster information

ray_head_ip = “kuberay-head-svc.kuberay.svc.cluster.local”
ray_head_port = 8265
ray_address = f"http://{ray_head_ip}:{ray_head_port}"

while True:
# Submit Ray job using JobSubmissionClient
client = JobSubmissionClient(ray_address)
job_id = client.submit_job(
entrypoint=“python run.py”,
runtime_env={
“working_dir”: “./”
},
entrypoint_num_cpus = 1,
)

print(client.__dict__)
print(f"Ray job submitted with job_id: {job_id}")
# Wait for a while to let the jobs run
time.sleep(10)

job_status = client.get_job_status(job_id)
get_job_logs = client.get_job_logs(job_id)
get_job_info = client.get_job_info(job_id)
async for lines in client.tail_job_logs(job_id):
    print(lines, end="")

# Shutdown Ray
ray.shutdown()