Actors cannot be scheduled after Serve Redeployment with Runtime Envs

Hi there,

my questions includes a lot of different topics: Kubernetes, Runtime Envs, custom resources and Serve Backends, I am not sure which one is the problematic part here…

I have the following issue:

I create some serve deployments by using ray client and connecting to a ray cluster and using runtime environments. This deployment when beeing called createds a Detached Actor, which needs to be accessed by other deployments.

You can imaging something like this:

import uuid

import ray
import requests
from ray import serve
from starlette.responses import JSONResponse


@ray.remote(num_cpus=0, resources={'specialResource': 0.01})
class CaseActor:
    def __init__(self):
        pass
    def some_func(self):
        print('HI')

@serve.deployment
class TestBackend:
    """ Actor for proving functionality form static_holo_verification library

    """

    def __init__(self):
        pass

    async def __call__(self, request) -> JSONResponse:
        """ Just wait"""
        actor = CaseActor.options(name=str(uuid.uuid4()), lifetime="detached").remote()
        print('Created the actor')
        await actor.some_func.remote()
        print('Called the actor function')
        return JSONResponse('Called the actor function')



if __name__ == '__main__':
    # Start with ray start --head --resources='{"specialResource": 5}'
    ray.init('ray://localhost:10001', runtime_env={'working_dir': '.', 'excludes': ['.git', 'venv', 'tests']},namespace='test')

    client = serve.start(detached=True,
                        http_options={'host': '0.0.0.0', 'port': '8000'})

    backends = [(TestBackend,{
            'num_replicas': 1,
            "max_concurrent_queries": 1,
            'ray_actor_options': {'num_cpus': 1, 'runtime_env': {'pip': ['pytest', 'ray[serve]']}}
        })]

    for func, parameters in backends:
        func.options(**parameters).deploy()

    print(requests.get(url='http://localhost:8000/TestBackend'))

At a later time I will redeploy the Deployment with either changed code or a different runtime env setup. E.g.

import ray
import requests
from ray import serve

from ray_initial import TestBackend

if __name__ == '__main__':
    ray.init('ray://localhost:10001',
             runtime_env={'working_dir': '.', 'excludes': ['.git', 'venv', 'tests'], 'pip': ['Werkzeug', 'ray[serve]']},
             namespace='test')

    client = serve.start(detached=True,
                        http_options={'host': '0.0.0.0', 'port': '8000'})

    backends = [(TestBackend,{
            'num_replicas': 1,
            "max_concurrent_queries": 1,
            'ray_actor_options': {'num_cpus': 1, 'runtime_env': {'pip': ['zipp', 'ray[serve]']}}
        })]

    for func, parameters in backends:
        func.options(**parameters).deploy()

    print(requests.get(url='http://localhost:8000/TestBackend'))

The problem is, the second time the cluster will not be able to create the actor anymore. It gets stuck with ray status showing something like this:

Node status
---------------------------------------------------------------
Healthy:
 1 rayApiWorkerType
 1 rayCpuWorkerType
 1 rayHeadType
 1 rayWorkerType
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------

Usage:
 0.03000000000000025/5.0 specialResource
 0.07099999999999973/5.0 ApiResources
...
A lot of groups similar to this, but none of them includes specialResource
 0.0/0.01 ApiResources_group_0_1b2f7452904eb8e8413e638a1a4ff2a6
 0.0/0.01 ApiResources_group_0_1c3046f61092c841043613a515457f75
...
Demands:
 {'specialResource': 0.01}: 2+ pending tasks/actors

Do you have any ideas why the Actor can not be scheduled anymore after a backend redeployment, although there are available resources?

In the logs I see this error:

./python-core-worker-81e9b08fa6542d4ff110c3c097d9f8862ce4ec71cfc9908aab89e491_12740.log:[2021-09-07 09:44:11,552 D 12740 12774] service_based_accessor.cc:185: Getting actor info, name = fc38a85a-8a40-41ad-bb56-45830dd6b39b
./python-core-worker-81e9b08fa6542d4ff110c3c097d9f8862ce4ec71cfc9908aab89e491_12740.log:[2021-09-07 09:44:11,553 D 12740 12768] core_worker.cc:2074: Failed to look up actor with name: fc38a85a-8a40-41ad-bb56-45830dd6b39b
./python-core-worker-81e9b08fa6542d4ff110c3c097d9f8862ce4ec71cfc9908aab89e491_12740.log:[2021-09-07 09:44:11,553 D 12740 12768] service_based_accessor.cc:197: Finished getting actor info, status = NotFound: Actor with name 'fc38a85a-8a40-41ad-bb56-45830dd6b39b' was not found., name = fc38a85a-8a40-41ad-bb56-45830dd6b39b

The print for the Created the actor is shown, but the next print will never displayed.

Hi @TanjaBayer

Thanks for the detailed report. I ran this locally using the code you posted (not using Kubernetes), using commit 6aa8a4eddcd6ea6be858bf607e5c9b5007c0dfe0, but couldn’t reproduce the error. (I got two [200] responses.). The ray status looked clean also, with nothing under Demands:.

I did see status = NotFound in the logs as well, but I think it might be a red herring, because I saw it for ServeController actor as well. It might just be because of "3. The actor hasn't been created because named actor creation is asynchronous." which appears below the log.

As for “The print for the Created the actor is shown, but the next print will never displayed.”, I think that may be a red herring as well–I observed this when running the first script, but after adding time.sleep(2) after the second print, I saw the second print as well. I do see “HI” printed in two different log files for two different CaseActors, so that means both were started successfully.

I wonder if this means I need to use Kubernetes to reproduce this. Let me check in with @simon-mo about this offline.

Yeah when testing the above script locally it also works :frowning_face: It was more for explaining how our use case looks approximately.

I am trying to find a way to reproduce it with some example. The backends we are deploying are much bigger. Not sure if that has any affect, thought…