Deploying ray serve on Kubernetes

karlo_st · February 11, 2021, 12:39pm

Hi,

I have a running Ray cluster on a Kubernetes cluster, starting a client works, but I have a strange issue when creating backend (example from documentation - Key Concepts — Ray v1.1.0) :

>>> client.create_backend("simple_backend_class", RequestHandler, "hello, world!")


Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/api.py", line 31, in check
    return f(self, *args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/api.py", line 295, in create_backend
    replica_config))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1379, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RayServeException): ray::ServeController.create_backend() (pid=75, ip=10.244.1.242)
  File "python/ray/_raylet.pyx", line 463, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 412, in ray._raylet.execute_task.function_executor
  File "python/ray/_raylet.pyx", line 1501, in ray._raylet.CoreWorker.run_async_func_in_event_loop
  File "/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 428, in result
    return self.__get_result()
  File "/home/ray/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/controller.py", line 836, in create_backend
    raise e
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/controller.py", line 833, in create_backend
    backend_config.num_replicas)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/serve/controller.py", line 282, in _scale_backend_replicas
    num_possible, current_num_replicas + num_possible))
ray.serve.exceptions.RayServeException: Cannot scale backend simple_backend_class to 1 replicas. Ray Serve tried to add 1 replicas but the resources only allows 0 to be added. To fix this, consider scaling to replica to 0 or add more resources to the cluster. You can check avaiable resources with ray.nodes().

I connected to Ray cluster like this:

if __name__ == "__main__":

    if ("RAY_HEAD_SERVICE_HOST" not in os.environ
                or os.environ["RAY_HEAD_SERVICE_HOST"] == ""):
            raise ValueError("RAY_HEAD_SERVICE_HOST environment variable empty."
                             "Is there a ray cluster running?")
    redis_host = os.environ["RAY_HEAD_SERVICE_HOST"]
    ray.init(address=redis_host + ":6379")
    #backend_config = serve.BackendConfig(num_replicas=1)
    #client = serve.start(detached=True, http_host="0.0.0.0")
    client = serve.connect()

I saw this post, but didn’t find any more info of what could it possibly mean:

Kind regards,
Karlo

sangcho · February 11, 2021, 6:49pm

cc @Dmitri @simon-mo

simon-mo · February 16, 2021, 9:42pm

ray.init should be ran to connect to local raylet address instead of the redis address. where is the script running? the head pod, a worker pod, or a job?

karlo_st · February 17, 2021, 8:13am

It’s running on the head pod. redis_host is actually head node address.
Anyways, in the meantime I transferred to Ray 1.2.0, and everything works like a charm for now,
although there were problems with firewall at first (which ports to open).
In the end, to connect manually to Ray Cluster these commands were used in Ray 1.2.0, Python 3.6.8:

ray start --head --port=6379 --redis-shard-ports=6380,6381 --object-manager-port=2384 --gcs-server-port=45451

ray start --address='<head_node_ip>:6379' --redis-password='5241590000000000' --object-manager-port=2384

Thanks.

konichuvak · March 2, 2021, 6:42pm

Just encountered the same error when connecting to the local raylet (i.e. not passing any address explicitly). The instance only has 2 cores, would that be a problem?

karlo_st · March 3, 2021, 8:10am

Hi, did you try maybe increasing --num_cpus argument (or maybe number of replicas in backend also), like this:
ray start --head --num_cpus=6 --port=6379

I had the same problem on a machine that has 2 CPUs, but on a machine with 8 CPUs it worked always with default settings, because num_cpus is by default set to number of cpus of the machine.
Also to confirm that I tried creating backend (client.create_backend()) on a machine with 8 CPUs, and then checking cluster resources (ray.cluster_resources() and ray.available_resources()), and it would always use 3 CPUs.
If seems as num_cpus is just number of replicas, because I tried putting 500 and it also worked.

Topic		Replies	Views
Question on Creating Backend with Serve Ray Serve	6	2127	February 1, 2022
Ray on AKS using Kubernetes Job with runtime_env working_dir throws error Kubernetes	6	1080	January 21, 2022
Checkpoint Backend creation Ray Serve	3	425	June 28, 2021
Sample ray program does not work on kubernetes with ray1.4.0 branch Kubernetes	1	477	June 10, 2021
Serving Ray on Kubernetes from Another App	5	626	August 4, 2021

Deploying ray serve on Kubernetes

Related topics