Calling serve._run hangs

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hello,

I am running a Ray Deployment which needs to in turn, create more deployments. I also don’t want to wait for the other deployments to be created before carrying on. I currently do this from the initial deployment by running:

worker_application = ModelDeployment.bind(
                    **distributed_model_deployment_args.model_dump()
                )
                worker_deployment = serve._run(
                    worker_application,
                    _blocking=False,
                    name=f"{self.replica_context.app_name}-{worker_world_rank}",
                    route_prefix=f"/{self.replica_context.app_name}-{worker_world_rank}",
                )

This works great IF the initial deployment happens to be on the head node of my cluster. If it does not, this command hangs indefinitely. Within the serve._run code, it hangs on:

ray.get(self._controller.deploy_application.remote(name, deployment_args_list))

So it seems nodes separate from the head node (the one with the ServeController Actor) can’t call remote methods on the ServeController. I see this within the python-core-worker.log file on the worker node trying to create the Deployment:

[2024-08-12 22:13:49,273 W 1789381 1789470] task_manager.cc:1103: Task attempt 14b5a0f6020ac5bc77a9b7d211b7f305d83e6f9001000000 failed with error ACTOR_UNAVAILABLE Fail immediately? 0, stat$
s RpcError: RPC Error message: recvmsg:Connection timed out; RPC Error details: , error info actor_unavailable_error {
  actor_id: "w\251\267\322\021\267\363\005\330>o\220\001\000\000\000"
}
error_message: "The actor is temporarily unavailable: RpcError: RPC Error message: recvmsg:Connection timed out; RPC Error details: "
error_type: ACTOR_UNAVAILABLE

[2024-08-12 22:13:49,273 I 1789381 1789470] task_manager.cc:1000: task 14b5a0f6020ac5bc77a9b7d211b7f305d83e6f9001000000 retries left: infinite, oom retries left: -1, task failed due to oom:
0
[2024-08-12 22:13:49,273 I 1789381 1789470] task_manager.cc:1004: Attempting to resubmit task 14b5a0f6020ac5bc77a9b7d211b7f305d83e6f9001000000 for attempt number: 0
[2024-08-12 22:13:49,273 I 1789381 1789470] core_worker.cc:440: Will resubmit task after a 0ms delay: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFuncti$
nDescriptor, module_name=ray.serve._private.controller, class_name=ServeController, function_name=deploy_application, function_hash=}, task_id=14b5a0f6020ac5bc77a9b7d211b7f305d83e6f90010000$
0, task_name=ServeController.deploy_application, job_id=01000000, num_args=4, num_returns=1, max_retries=-1, depth=3, attempt_number=1, actor_task_spec={actor_id=77a9b7d211b7f305d83e6f90010$
0000, actor_caller_id=ffffffffffffffff91ef5d032155255b545c86ae01000000, actor_counter=2, retry_exceptions=0}

Is there something I’m doing wrong here or is there an RPC timeout I can to get this to work?

Thanks

This shouldn’t be the case, they should be able to be run from anywhere on the cluster.

However, what you’re doing isn’t really a recommended pattern. Can you explain more about why you are using serve._run instead of .binding another deployment using the public API?

Do you mean serve.run vs serve._run? It is my understanding that while you can set blocking=False in serve.run, the underlying call to serve._run will block until the init method of the new application is finished.

What I am doing is calling torch.init_process_group at the end of the init method for the initial deployment, as well as the ones that it creates. This requires all applications to be created and get to the init_process_group call before finishing their init, therefore the initial deployment can’t wait for the others to finish. Let me know if this makes sense or you need more information.

Also, is there an environment variable I can set somewhere to increase the time the Actor has to respond to see if that is indeed the issue?