[High] [Ray Serve] Run gRPC services in one cluster

I have several problems.

  1. Is it possible to run multiple gRPC services in the same cluster, which I’m assuming to connect 2-3 remote servers (nodes) to?
    To then knock on the service at the same address (host:port) with a different endpoint?

For example,
head = 172.198.0.2:9000
I started 2 services in a cluster connected to head with different endpoints, but on the same port. And now I would like to be able to make requests to any of them on such hosts:
172.198.0.2:9000/cluster_gpu/models1
172.198.0.2:9000/cluster_gpu/models2
172.198.0.2:9000/cluster_cpu/models3 - (for example, launched another cluster without gpu)

I tried to start 1 service on a remote machine, but I don’t understand why the service is started on the main node, from the documentation it seemed to me that the main node serves just as a router.

  1. Moreover, I ran into the problem that the service starts at the same time on the remote machine.
    Moreover, it endlessly restarts the replicas on the current host, which I meant to dispose of, and successfully on the head node.

Now I’m running it locally via docker-compose, ray[serve]==2.1.0

My Dockerfile for head node (ex, 172.198.0.2):

FROM rayproject/ray:2.1.0-gpu
...
ENV RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=0

ENTRYPOINT [\
    "ray", "start",\
    "--head",\
    "--port=6379",\
    "--redis-shard-ports=6380,6381",\
    "--object-manager-port=22345",\
    "--node-manager-port=22346",\
    "--dashboard-host=0.0.0.0",\
    "--ray-client-server-port=10001",\
    "--block"]

Dockerfile for serve node (ex, 172.198.0.3, and RAY_HEAD_ADDRESS=172.198.0.3:6379):

FROM rayproject/ray:2.1.0-gpu
...
ENV RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=0
ENV RAY_HEAD_ADDRESS=${RAY_HEAD_ADDRESS}

CMD ray start --address=$RAY_HEAD_ADDRESS && \
    serve run \
        serve_grpc:my_deployment \
        --runtime-env=runtime.yaml

serve_grpc.py

....
@serve.deployment(
    is_driver_deployment=True,
    name="adapter",
    ray_actor_options={"num_gpus": 1},
    num_replicas=1,
    router_prefix="/gpu/models1"
)
class ModelsServices(models_pb2_grpc.ModelsServiceServicer, gRPCIngress):
   ....

my_deployment = ModelsServices.bind()

Tell me, please, how to do it right?

Hi @psydok , thank you for trying it out!

  1. router_prefix is not allowed in gRPC use case (the attribute check will be added, ticket for tracking.). Each node will have a dedicated endpoint port 9000.
  2. You can add model_name in your schema, based on the request, you can routing the traffic internally to different subsequent models.
if request.model_name == "Model_A":
    self.model_A_handle.remote(xxx)
if request.model_name == "Model_B":
    self.model_B_handle.remote(xxx)

Hope this works for your use case.

Thanks for answering!
Can you suggest how can I get model_A_handle?
I decided to separate the clusters and run different ones for the service and for the models.

First container for models:
Dockerfile

FROM rayproject/ray:latest-gpu
...
CMD ray start --head && python model_A.py

model_A.py

@ray.remote(num_gpus=1)
class ModelA:
    ...

if __name__ == "__main__":
    a = ModelA.options(
        name="general_models", lifetime="detached", namespace="models"
    ).remote()
    # otherwise the container exits
    while True:
        time.sleep(100)

Second container in a separate project:
Dockerfile

FROM rayproject/ray:latest-gpu
...
CMD  ray start --head && python serve_grpc.py

serve_grpc.py

@serve.deployment(
    is_driver_deployment=True,
    name="adapter",
    ray_actor_options={"num_gpus": 1},
    num_replicas=1,
)
class ModelsServices(models_pb2_grpc.ModelsServiceServicer, gRPCIngress):
    def __init__(self):
        self.cli1 = ray.init(address="ray://first_container_address:10001", allow_multiple=True, namespace="models")
        with self.cli1:
                self.model_a = ray.get_actor("general_models")

if __name__ == "__main__":
    a_deployment = ModelsServices.bind()
    serve.run(a_deployment)

But I can’t get an actor.
The service returns an error when I try to get an actor (model_a) from another cluster:

ValueError: Failed to look up actor with name ‘general_models’. This could because 1. You are trying to look up a named actor you didn’t create. 2. The named actor died. 3. You did not use a namespace matching the namespace of the actor.

I moved the actor to one node with head and it seems to work. The only confusing thing is that you have to write while True so that the container doesn’t exit. Maybe there is another option?
(Dockerfile → CMD ray start --head --dashboard-host=0.0.0.0 --port=6379 && python general.py)
But now, when I’m on the service (2 container with grpc) trying to process the request through the handle of the actor, I get this error in container:

service-general-1 | (ServeReplica:adapter pid=303) Error in data channel:
service-general-1 | (ServeReplica:adapter pid=303) 0
service-general-1 | (ServeReplica:adapter pid=303) Queue filler thread failed to join before timeout: 10
service-general-1 | (ServeReplica:adapter pid=303) 2022-12-05 12:06:49,743 ERROR dataclient.py:323 – Unrecoverable error in data channel. component=serve deployment=adapter replica=adapter#PPsNhc

The following error is received by the client when receiving a response from the service:

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNKNOWN
details = "Unexpected <class ‘ConnectionError’>: Failed during this or a previous request. Exception that broke the connection: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.FAILED_PRECONDITION
details = “No module named ‘models_pb2’”
debug_error_string = “UNKNOWN:Error received from peer ipv4:10.80.0.21:10001 {grpc_message:“No module named 'models_pb2'”, grpc_status:9, created_time:“2022-12-05T12:06:49.743138168-08:00”}”

Hi @psydok ,

  1. For passing model handle, i think you can read Experimental Direct Ingress — Ray 2.1.0 example. You can directly pass RayServeDeploymentHandle by doing
    orange_stand = OrangeStand.bind()
    apple_stand = AppleStand.bind()
    fruit_market = FruitMarket.bind(orange_stand, apple_stand)

Let me know if that works to you.

  1. And yes, if you don’t want your script exit, you can do sleep forever (Simplest way right now for testing purpose). Alternative You can also try to use ray/api.py at master · ray-project/ray · GitHub to make it serve running detached. (Private API, not recommend to use it in prod)
    If you are with k8s, you can also try RayService - KubeRay Docs

Thank you. It seems to work. But for some reason I lost the ability to autoscale. When running locally, I was able to get 20 requests per second with a load of 50 requests per second (with a batch size of 5). Now only 5 requests per second under the same conditions.
You could not prompt in what there can be a business?
I tried setting it explicitly (although it wasn’t required before), but it didn’t help.

@serve.deployment(
    is_driver_deployment=True,
    name="general",
    autoscaling_config={
        "min_replicas": 1,
        "max_replicas": 5,
        "target_num_ongoing_requests_per_replica": 10,
    }
)

Could the problem be that the model worker was raised on the head node?

I have a problem. Some minor changes radically change the expected actions from the service. I changed the service dockerfile: CMD serve run main:service_deployment and now I have no logs in the console that the service was deployed. Only about the dashboard, the page of which does not load. At the same time, the service is available and processes requests,
it’s just that there are no logs in the console, except for these:

service-general-1 | (ServeReplica:service pid=614) not request_id(ServeReplica:service pid=614) 2022-12-08 00:51:39,858 ERROR dataclient.py:323 – Unrecoverable error in data channel. component=serve deployment=service replica=service#DLDuOR2022-12-08 01:23:22,355 WARNING services.py:1933 – WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing ‘–shm-size=10.22gb’ to ‘docker run’ (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
service-general-1 | 2022-12-08 01:23:22,475 INFO worker.py:1525 – Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
service-general-1 | (ServeReplica:service pid=614) (raylet) [2022-12-07 11:51:29,257 E 116 157] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2022-12-06_23-52-46_926553_6 is over 95% full, available space: 8328564736; capacity: 1913657327616. Object creation will fail if spilling is required.
service-general-1 | (ServeReplica:service pid=614) (raylet) [2022-12-07 11:51:29,257 E 116 157] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2022-12-06_23-52-46_926553_6 is over 95% full, available space: 8328564736; capacity: 1913657327616. Object creation will fail if spilling is required.
service-general-1 | (ServeReplica:service pid=614) (raylet) [2022-12-07 11:51:29,257 E 116 157] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2022-12-06_23-52-46_926553_6 is over 95% full, available space: 8328564736; capacity: 1913657327616. Object creation will fail if spilling is required.