Calling serve._run hangs

JadenFK · August 13, 2024, 3:44am

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hello,

I am running a Ray Deployment which needs to in turn, create more deployments. I also don’t want to wait for the other deployments to be created before carrying on. I currently do this from the initial deployment by running:

worker_application = ModelDeployment.bind(
                    **distributed_model_deployment_args.model_dump()
                )
                worker_deployment = serve._run(
                    worker_application,
                    _blocking=False,
                    name=f"{self.replica_context.app_name}-{worker_world_rank}",
                    route_prefix=f"/{self.replica_context.app_name}-{worker_world_rank}",
                )

This works great IF the initial deployment happens to be on the head node of my cluster. If it does not, this command hangs indefinitely. Within the serve._run code, it hangs on:

ray.get(self._controller.deploy_application.remote(name, deployment_args_list))

So it seems nodes separate from the head node (the one with the ServeController Actor) can’t call remote methods on the ServeController. I see this within the python-core-worker.log file on the worker node trying to create the Deployment:

[2024-08-12 22:13:49,273 W 1789381 1789470] task_manager.cc:1103: Task attempt 14b5a0f6020ac5bc77a9b7d211b7f305d83e6f9001000000 failed with error ACTOR_UNAVAILABLE Fail immediately? 0, stat$
s RpcError: RPC Error message: recvmsg:Connection timed out; RPC Error details: , error info actor_unavailable_error {
  actor_id: "w\251\267\322\021\267\363\005\330>o\220\001\000\000\000"
}
error_message: "The actor is temporarily unavailable: RpcError: RPC Error message: recvmsg:Connection timed out; RPC Error details: "
error_type: ACTOR_UNAVAILABLE

[2024-08-12 22:13:49,273 I 1789381 1789470] task_manager.cc:1000: task 14b5a0f6020ac5bc77a9b7d211b7f305d83e6f9001000000 retries left: infinite, oom retries left: -1, task failed due to oom:
0
[2024-08-12 22:13:49,273 I 1789381 1789470] task_manager.cc:1004: Attempting to resubmit task 14b5a0f6020ac5bc77a9b7d211b7f305d83e6f9001000000 for attempt number: 0
[2024-08-12 22:13:49,273 I 1789381 1789470] core_worker.cc:440: Will resubmit task after a 0ms delay: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFuncti$
nDescriptor, module_name=ray.serve._private.controller, class_name=ServeController, function_name=deploy_application, function_hash=}, task_id=14b5a0f6020ac5bc77a9b7d211b7f305d83e6f90010000$
0, task_name=ServeController.deploy_application, job_id=01000000, num_args=4, num_returns=1, max_retries=-1, depth=3, attempt_number=1, actor_task_spec={actor_id=77a9b7d211b7f305d83e6f90010$
0000, actor_caller_id=ffffffffffffffff91ef5d032155255b545c86ae01000000, actor_counter=2, retry_exceptions=0}

Is there something I’m doing wrong here or is there an RPC timeout I can to get this to work?

Thanks

eoakes · August 13, 2024, 4:24pm

This shouldn’t be the case, they should be able to be run from anywhere on the cluster.

However, what you’re doing isn’t really a recommended pattern. Can you explain more about why you are using serve._run instead of .binding another deployment using the public API?

JadenFK · August 13, 2024, 9:00pm

Do you mean serve.run vs serve._run? It is my understanding that while you can set blocking=False in serve.run, the underlying call to serve._run will block until the init method of the new application is finished.

What I am doing is calling torch.init_process_group at the end of the init method for the initial deployment, as well as the ones that it creates. This requires all applications to be created and get to the init_process_group call before finishing their init, therefore the initial deployment can’t wait for the others to finish. Let me know if this makes sense or you need more information.

Also, is there an environment variable I can set somewhere to increase the time the Actor has to respond to see if that is indeed the issue?

Topic		Replies	Views
.remote() call occasionally hangs Ray Serve	3	335	October 7, 2024
Serve Handle Remote Calls Block Forever Ray Serve	7	833	April 16, 2023
Error when trying to get handle to Ray Serve deployment Ray Serve	2	1051	February 15, 2022
Actor creation causes serve.deployment to error with ray.init twice	2	563	July 11, 2023
Call method from other serve deployment already in the init Ray Core	6	831	November 9, 2023

Calling serve._run hangs

Related topics