1. Severity of the issue: (select one)
High: Completely blocks me.
2. Environment:
- Ray version: ray, version 2.48.0
- Python version: Python 3.10.12
- OS: Ubuntu 22:04
- Cloud/Infrastructure: VastAI cloud
- Other libs/tools (if relevant): using with vllm
3. What happened vs. what you expected:
- Expected: I am trying to create head and worker node on two different instance hosted on vastai cloud.. using command mentioned in document
- Actual: I am not able to get it working, worker is joining cluster but after sometime gcs server making worker dead because of health check failure, This is happening because weird networking which i will explain below
I have created two instances on vastai using template which opens 4 ports say A, B, C, D. Instances are docker/vm running on host, so that my VM port (i.e. A, B, C, D) forwarded to random ports which are different than A, B, C, D. And each instance will have different external port assigned for given same internal port.
if node manager port for my worker is internal port, head not is not able to connect during healthcheck because port is internal.
i am giving instance’s public ip in configuration so two instance can discover each other but can not connect on port.
So in summary, i am not able top run ray cluster on vms which are not in same network. And port configuration is not consistent.
I was wondering, if i am missing something.. i have not seen any guide about how get ray cluster working on vastai cloud anywhere, i also done lot of research using gpt5 and other llm to find answer.
following two command I am using to run head and worker
ray start --head \
--port=$HOST_INTERNAL_PORT1
ray start --address="$HEAD_IP:$HOST_EXTERNAL_PORT1" \
--node-manager-port=$WORKER_INTERNAL_PORT_2 \
--node-manager-host=$WORKER_IP
Please help me, this is blocker in my project