Hi there,
my questions includes a lot of different topics: Kubernetes, Runtime Envs, custom resources and Serve Backends, I am not sure which one is the problematic part here…
I have the following issue:
I create some serve deployments by using ray client and connecting to a ray cluster and using runtime environments. This deployment when beeing called createds a Detached Actor, which needs to be accessed by other deployments.
You can imaging something like this:
import uuid
import ray
import requests
from ray import serve
from starlette.responses import JSONResponse
@ray.remote(num_cpus=0, resources={'specialResource': 0.01})
class CaseActor:
def __init__(self):
pass
def some_func(self):
print('HI')
@serve.deployment
class TestBackend:
""" Actor for proving functionality form static_holo_verification library
"""
def __init__(self):
pass
async def __call__(self, request) -> JSONResponse:
""" Just wait"""
actor = CaseActor.options(name=str(uuid.uuid4()), lifetime="detached").remote()
print('Created the actor')
await actor.some_func.remote()
print('Called the actor function')
return JSONResponse('Called the actor function')
if __name__ == '__main__':
# Start with ray start --head --resources='{"specialResource": 5}'
ray.init('ray://localhost:10001', runtime_env={'working_dir': '.', 'excludes': ['.git', 'venv', 'tests']},namespace='test')
client = serve.start(detached=True,
http_options={'host': '0.0.0.0', 'port': '8000'})
backends = [(TestBackend,{
'num_replicas': 1,
"max_concurrent_queries": 1,
'ray_actor_options': {'num_cpus': 1, 'runtime_env': {'pip': ['pytest', 'ray[serve]']}}
})]
for func, parameters in backends:
func.options(**parameters).deploy()
print(requests.get(url='http://localhost:8000/TestBackend'))
At a later time I will redeploy the Deployment with either changed code or a different runtime env setup. E.g.
import ray
import requests
from ray import serve
from ray_initial import TestBackend
if __name__ == '__main__':
ray.init('ray://localhost:10001',
runtime_env={'working_dir': '.', 'excludes': ['.git', 'venv', 'tests'], 'pip': ['Werkzeug', 'ray[serve]']},
namespace='test')
client = serve.start(detached=True,
http_options={'host': '0.0.0.0', 'port': '8000'})
backends = [(TestBackend,{
'num_replicas': 1,
"max_concurrent_queries": 1,
'ray_actor_options': {'num_cpus': 1, 'runtime_env': {'pip': ['zipp', 'ray[serve]']}}
})]
for func, parameters in backends:
func.options(**parameters).deploy()
print(requests.get(url='http://localhost:8000/TestBackend'))
The problem is, the second time the cluster will not be able to create the actor anymore. It gets stuck with ray status
showing something like this:
Node status
---------------------------------------------------------------
Healthy:
1 rayApiWorkerType
1 rayCpuWorkerType
1 rayHeadType
1 rayWorkerType
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.03000000000000025/5.0 specialResource
0.07099999999999973/5.0 ApiResources
...
A lot of groups similar to this, but none of them includes specialResource
0.0/0.01 ApiResources_group_0_1b2f7452904eb8e8413e638a1a4ff2a6
0.0/0.01 ApiResources_group_0_1c3046f61092c841043613a515457f75
...
Demands:
{'specialResource': 0.01}: 2+ pending tasks/actors
Do you have any ideas why the Actor can not be scheduled anymore after a backend redeployment, although there are available resources?
In the logs I see this error:
./python-core-worker-81e9b08fa6542d4ff110c3c097d9f8862ce4ec71cfc9908aab89e491_12740.log:[2021-09-07 09:44:11,552 D 12740 12774] service_based_accessor.cc:185: Getting actor info, name = fc38a85a-8a40-41ad-bb56-45830dd6b39b
./python-core-worker-81e9b08fa6542d4ff110c3c097d9f8862ce4ec71cfc9908aab89e491_12740.log:[2021-09-07 09:44:11,553 D 12740 12768] core_worker.cc:2074: Failed to look up actor with name: fc38a85a-8a40-41ad-bb56-45830dd6b39b
./python-core-worker-81e9b08fa6542d4ff110c3c097d9f8862ce4ec71cfc9908aab89e491_12740.log:[2021-09-07 09:44:11,553 D 12740 12768] service_based_accessor.cc:197: Finished getting actor info, status = NotFound: Actor with name 'fc38a85a-8a40-41ad-bb56-45830dd6b39b' was not found., name = fc38a85a-8a40-41ad-bb56-45830dd6b39b
The print for the Created the actor
is shown, but the next print will never displayed.