ModuleNotFoundError during deployment after upgrade to 1.9.0

hi, I’m trying to upgrade my ray cluster to latest 1.9.0 but encountered below error during serve deployment.

  1. @serve class:
@serve.deployment(num_replicas=1)
@serve.ingress(app=serve_app)
class PricingApi(object):
  1. deploy script:
ray.init(address='localhost:6379',
            _redis_password='5241590000000000',
            log_to_driver=False,
            namespace='serve')
PricingApi.deploy()
  1. I run above deploy script from project root and I got errors:

2021-12-07 03:48:37,383	INFO worker.py:842 -- Connecting to existing Ray cluster at address: 10.1.1.14:6379
2021-12-07 03:48:37,652	INFO api.py:242 -- Updating deployment 'PricingApi'. component=serve deployment=PricingApi
{'object_store_memory': 1491309772.0, 'memory': 2982619547.0, 'CPU': 2.0, 'node:10.1.1.14': 1.0}
Traceback (most recent call last):
  File "deploy_ray_serve.py", line 10, in <module>
    deploy_cluster()
  File "deploy_ray_serve.py", line 7, in deploy_cluster
    PricingApi.deploy()
  File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/ray/serve/api.py", line 789, in deploy
    return _get_global_client().deploy(
  File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/ray/serve/api.py", line 93, in check
    return f(self, *args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/ray/serve/api.py", line 248, in deploy
    self._wait_for_goal(goal_id)
  File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/ray/serve/api.py", line 184, in _wait_for_goal
    raise async_goal_exception
RuntimeError: Deployment 'PricingApi' failed, deleting it asynchronously.
  1. Here’s related log from ray’s log:

2021-12-07 03:48:39,670	ERROR worker.py:431 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::SERVE_REPLICA::PricingApi#gxItCf:RayServeWrappedReplica.__init__ (pid=3668, ip=10.1.1.14)
  File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/ray/serve/replica.py", line 48, in __init__
    deployment_def = cloudpickle.loads(serialized_deployment_def)
ModuleNotFoundError: No module named 'bct'
:actor_name:PricingApi

Any thought?

It turns out I have to run ‘serve start’ from my project root … there isn’t such a ‘restriction’ in the past, is it expected?

hi @blshao84 thanks for trying out our latest release. I believe we have serve.start() / serve start in our e2e tutorial since ray 1.3 documentation: End-to-End Tutorial — Ray v2.0.0.dev0. It’s required to initialize serve apis. Maybe you had it working with serve.start(detached=True) or executed cli serve start such that your serve has been running on the background, but your recent upgrade required you to restart and it needs to be executed again ?

hi @jiaodong, I got above ‘ModuleNotFound’ issue from my regression test, which launched ray and serve as below:

ray start --head
serve start
python deploy.py

In deploy.py:

ray.init(address='localhost:6379', namespace='serve')
PricingApi.deploy()

Before 1.9.0, my ‘serve start’ and ‘python deploy.py’ were not executed from the same directory and it worked fine. But with 1.9.0, I have to run both ‘serve start’ and ‘python deploy.py’ from my project root. I’m totally ok with that (and I think it’s better to do it this way), but I’m just curious is there any specific change in this release that forced it?

That’s interesting … serve start is simply wrapper of calling serve.start() from serve api and it shouldn’t matter where you execute it. How did you install and import module bct in your code, and do you see the same symptom without using it ?

‘bct’ is not a 3rd party lib but my own source code. Here’s my project structure:

project_root
      bct
      examples
      tests
      deploy.py
      ...

it only works if I start serve and python deploy.py all from ‘project_root’

hi @jiaodong, I got a repro here: GitHub - blshao84/ray-deploy-repro: reproduce ray's serve deployment issue

Thanks @blshao84 for the context, this is very comprehensive. I created [Bug] Highlights include: ✔ Ray Train is now in beta ✔ Ray Datasets now supports groupby and aggregations ✔ Ray Docker images for multiple CUDA versions are now provided ✔ Improving Ray stability and usability on Windows ✔ Launching of a Ray Job Submission server + CLI & SDK clients to make it easier to submit and monitor Ray applications And there’s more. Head over to the release blog for the deep dive. ModuleNotFoundError during deployment after upgrade to 1.9.0 · Issue #21095 · ray-project/ray · GitHub to keep track of the issue and will look into it soon.

@blshao84 Thanks for your patience! We’re still looking into the issue. We have a company holiday until Jan 3, so we’ll be able to take a closer look afterwards. Happy holidays!

Any update on this? I just found another similar situation when this issue happened: GitHub - blshao84/ray-deploy-repro at tests_dir

Hi @blshao84, thank you for your reproduction, Ray cannot magically move things and find Python modules without some hints. Please take a look at Handling Dependencies — Ray v1.9.2 and let me know whether it helps!