- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hello,
I am running a Ray Deployment which needs to in turn, create more deployments. I also don’t want to wait for the other deployments to be created before carrying on. I currently do this from the initial deployment by running:
worker_application = ModelDeployment.bind(
**distributed_model_deployment_args.model_dump()
)
worker_deployment = serve._run(
worker_application,
_blocking=False,
name=f"{self.replica_context.app_name}-{worker_world_rank}",
route_prefix=f"/{self.replica_context.app_name}-{worker_world_rank}",
)
This works great IF the initial deployment happens to be on the head node of my cluster. If it does not, this command hangs indefinitely. Within the serve._run code, it hangs on:
ray.get(self._controller.deploy_application.remote(name, deployment_args_list))
So it seems nodes separate from the head node (the one with the ServeController Actor) can’t call remote methods on the ServeController. I see this within the python-core-worker.log file on the worker node trying to create the Deployment:
[2024-08-12 22:13:49,273 W 1789381 1789470] task_manager.cc:1103: Task attempt 14b5a0f6020ac5bc77a9b7d211b7f305d83e6f9001000000 failed with error ACTOR_UNAVAILABLE Fail immediately? 0, stat$
s RpcError: RPC Error message: recvmsg:Connection timed out; RPC Error details: , error info actor_unavailable_error {
actor_id: "w\251\267\322\021\267\363\005\330>o\220\001\000\000\000"
}
error_message: "The actor is temporarily unavailable: RpcError: RPC Error message: recvmsg:Connection timed out; RPC Error details: "
error_type: ACTOR_UNAVAILABLE
[2024-08-12 22:13:49,273 I 1789381 1789470] task_manager.cc:1000: task 14b5a0f6020ac5bc77a9b7d211b7f305d83e6f9001000000 retries left: infinite, oom retries left: -1, task failed due to oom:
0
[2024-08-12 22:13:49,273 I 1789381 1789470] task_manager.cc:1004: Attempting to resubmit task 14b5a0f6020ac5bc77a9b7d211b7f305d83e6f9001000000 for attempt number: 0
[2024-08-12 22:13:49,273 I 1789381 1789470] core_worker.cc:440: Will resubmit task after a 0ms delay: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFuncti$
nDescriptor, module_name=ray.serve._private.controller, class_name=ServeController, function_name=deploy_application, function_hash=}, task_id=14b5a0f6020ac5bc77a9b7d211b7f305d83e6f90010000$
0, task_name=ServeController.deploy_application, job_id=01000000, num_args=4, num_returns=1, max_retries=-1, depth=3, attempt_number=1, actor_task_spec={actor_id=77a9b7d211b7f305d83e6f90010$
0000, actor_caller_id=ffffffffffffffff91ef5d032155255b545c86ae01000000, actor_counter=2, retry_exceptions=0}
Is there something I’m doing wrong here or is there an RPC timeout I can to get this to work?
Thanks