I have a cluster consisting of three machines (A, B, C). I want to specify the ip of B through serve.start() on machine A, and it prompts me that the binding fails. Does the ray server support this deployment method?
@839576266 you only need to call serve.start
on one machine; the replicas of your deployments will be placed across the cluster.
The http_host
is what host the http server should bind to on each node (typically localhost
or 0.0.0.0
).
I tried to bind the host in serve.start(), but it doesn’t seem to work, is there anything else I need to configure?
import time
import ray
from ray import serve
environment_dict = {
"working_dir": "/home/hwd/kernel"
}
ray.init(address='ray://10.3.70.138:10001', runtime_env=environment_dict)
http_dict = {"host": "10.3.70.140", "port": 9009}
serve.start(http_options=http_dict)
@serve.deployment
def hello(request):
name = request.query_params["name"]
return f"Hello {name}!"
# Deploy model.
info = hello.options(num_replicas=3).deploy()
while True:
time.sleep(5)
But script occurred exception:
Traceback (most recent call last):
File "/tmp/pycharm_project_188/tests/hik_serving/test/ray_test.py", line 19, in <module>
serve.start(http_options=http_dict)
File "/usr/lib/fedai/hikflkernel/.venv/lib/python3.8/site-packages/ray/serve/api.py", line 465, in start
ray.get(
File "/usr/lib/fedai/hikflkernel/.venv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return getattr(ray, func.__name__)(*args, **kwargs)
File "/usr/lib/fedai/hikflkernel/.venv/lib/python3.8/site-packages/ray/util/client/api.py", line 42, in get
return self.worker.get(vals, timeout=timeout)
File "/usr/lib/fedai/hikflkernel/.venv/lib/python3.8/site-packages/ray/util/client/worker.py", line 359, in get
res = self._get(to_get, op_timeout)
File "/usr/lib/fedai/hikflkernel/.venv/lib/python3.8/site-packages/ray/util/client/worker.py", line 386, in _get
raise err
types.RayTaskError(ValueError): ray::HTTPProxyActor.ready() (pid=27111, ip=10.3.70.138, repr=<ray.serve.http_proxy.HTTPProxyActor object at 0x7f8f8119be20>)
OSError: [Errno 99] Cannot assign requested address
During handling of the above exception, another exception occurred:
ray::HTTPProxyActor.ready() (pid=27111, ip=10.3.70.138, repr=<ray.serve.http_proxy.HTTPProxyActor object at 0x7f8f8119be20>)
File "/usr/local/abm/python/lib/python3.8/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
File "/usr/local/abm/python/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
raise self._exception
File "/usr/lib/fedai/hikflkernel/.venv/lib/python3.8/site-packages/ray/serve/http_proxy.py", line 329, in ready
return await done_set.pop()
File "/usr/lib/fedai/hikflkernel/.venv/lib/python3.8/site-packages/ray/serve/http_proxy.py", line 348, in run
raise ValueError(
ValueError: Failed to bind Ray Serve HTTP proxy to '10.3.70.140:9009'.
Please make sure your http-host and http-port are specified correctly.
@839576266 are you sure that 10.3.70.140
is the right address to bind to on the cluster? If you try localhost
or 0.0.0.0
does it work?
I have deployed a ray cluster where 10.3.70.138 is the head node and 10.3.70.140 is the work node. I am running this program on 10.3.70.138 and if I use localhost it will deploy a serve deployment on 10.3.70.138. But I want to deploy serve deployment on 10.3.70.140, I’m not sure if ray supports this
@839576266 the replicas of the deployment will run across the cluster, you don’t need to set the host
for each node in the cluster individually.
What does replicas in ray serve refer to? I completed my deployment through Hello.options(num_replicas=3).deploy(), but my port can only be monitored on the head node, other machines in the cluster do not find related processes
The head node’s http proxy will load balance requests to each of the Hello replicas.