Ray Deserializes Function on Head In Addition to Worker

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

We have the following situation running Ray using kuberay:

  1. Local machine has up-to-date version of code
  2. Using Ray Client, submits a function for parallel execution on Ray (the function is imported to where the Ray calls are made)
  3. Ray Head has not been updated recently, so does not have that function present. Our Head defers all computations to Workers, which are created with the container containing the newest code, so execution always works on the Workers.
  4. However, despite deferring all computation to Workers, a ModuleNotFoundError is thrown.

Our expectation was the execution would go like this:

  1. Local machine cloudpickles the import path to the function we want to run, generating a BLOB
  2. Ray stores a mapping between some UID and the BLOB in the GCS
  3. The UID and relevant args are sent to Ray Worker pods
  4. The Ray Worker pod gets the BLOB using the UID, then deserializes it locally, resolving the import path to actual code on the Ray worker pod.

However, it seems that the Ray Head is also deserializing the BLOB at some point, which fails the job because the Ray Head doesn’t have the new module on it yet.

Any ideas why this is from an architectural perspective, and is there a way for us to avoid any type of deserialization occurring on the Ray Head?

cc: @timothyngo

Example of what the relevant code would look like:

# run.py
from some_module import some_fn_parallel

# connect to ray
some_fn_parallel()
# some_module.py
@ray.remote
def some_fn():
    pass

def some_fn_parallel():
    ray.get([some_fn.remote(x) for x in args])

How ray client works is when you use the Ray API, it is proxied from a server running in a head node. So I assume deserialization happens there.

Is there a reason the head node has to have different env? It is not a well supported path (and we run 0 test for this scenario). If you’d like to setup environment at runtime, I recommend you to use Environment Dependencies — Ray 3.0.0.dev0 instead.

@timothyngo As @sangcho pointed out, if you want environment dependencies than you can use
them per specific task/actor/job or cluster wide.

Let us know if that works for you.

There isn’t a particular reason that the head node must have a different environment, but our code deploys happen frequently, and we don’t currently know of a clear story for doing rolling upgrades of a k8s-based Ray cluster (i.e., upgrading w/o losing jobs).

We can get around it with runtime environments, we were just surprised it was necessary based on how we thought the serialization operated.