- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
We have the following situation running Ray using kuberay
:
- Local machine has up-to-date version of code
- Using Ray Client, submits a function for parallel execution on Ray (the function is imported to where the Ray calls are made)
- Ray Head has not been updated recently, so does not have that function present. Our Head defers all computations to Workers, which are created with the container containing the newest code, so execution always works on the Workers.
- However, despite deferring all computation to Workers, a
ModuleNotFoundError
is thrown.
Our expectation was the execution would go like this:
- Local machine
cloudpickle
s the import path to the function we want to run, generating a BLOB - Ray stores a mapping between some UID and the BLOB in the GCS
- The UID and relevant args are sent to Ray Worker pods
- The Ray Worker pod gets the BLOB using the UID, then deserializes it locally, resolving the import path to actual code on the Ray worker pod.
However, it seems that the Ray Head is also deserializing the BLOB at some point, which fails the job because the Ray Head doesn’t have the new module on it yet.
Any ideas why this is from an architectural perspective, and is there a way for us to avoid any type of deserialization occurring on the Ray Head?
cc: @timothyngo
Example of what the relevant code would look like:
# run.py
from some_module import some_fn_parallel
# connect to ray
some_fn_parallel()
# some_module.py
@ray.remote
def some_fn():
pass
def some_fn_parallel():
ray.get([some_fn.remote(x) for x in args])