Ray tasks mixing functions from different files

I got a bizarre stack trace back from one of my Ray tasks that runs functions and classes out of a local Python file we have called “djlearn.py”.

If you look at the last two lines of the stack trace below you can see functions coming from two different versions of the same file:

/ceph/var/elcano/main/python.focusvq/focusvq/djakubiec/djlearn.py
/ceph/home/djakubiec/elcano/python.focusvq/focusvq/djakubiec/djlearn.py

One of these is our production version of the file, the other one is a development version. We do run both kinds of jobs on this same Ray cluster, but we certainly don’t intermingle them in the same ray exec <script> runs.

I am guessing that the cluster has somehow cached the functions from these files at different times, and then somehow mixed them during this latest script run?

This seems like a Ray bug to me, or maybe we are breaking some rules about how to use Ray? Like perhaps “don’t mix file versions on the same Ray cluster” or something?

Can someone please advise, thank you!

Traceback (most recent call last):
  File "mnesModel4.py", line 771, in <module>
    models.run()
  File "/ceph/var/elcano/main/python.focusvq/focusvq/djakubiec/djlearn.py", line 182, in run
    self.processGroups()
  File "/ceph/var/elcano/main/python.focusvq/focusvq/djakubiec/djlearn.py", line 405, in processGroups
    groupData = ray.get(ready)
  File "/home/ray/anaconda3/envs/dan-1/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 82, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/envs/dan-1/lib/python3.8/site-packages/ray/worker.py", line 1621, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): ray::rayProcessGroupTrain() (pid=102573, ip=10.0.1.34)
  File "/ceph/var/elcano/main/python.focusvq/focusvq/djakubiec/djlearn.py", line 725, in rayProcessGroupTrain
    return processGroupTrain(parameters, sourceFrame, groupData)
  File "/ceph/home/djakubiec/elcano/python.focusvq/focusvq/djakubiec/djlearn.py", line 758, in processGroupTrain
    log.info(f"Training group {groupData['groupIndex']}: {groupData['testGroups'][0]}, randomize {parameters.randomizeTestFeatures}/{parameters.randomizeAllFeatures}/{parameters.addGoalFeature}")
AttributeError: 'Namespace' object has no attribute 'randomizeAllFeatures'
Shared connection to 10.0.1.34 closed.

I was able to reproduce this stack trace consistently.

I then did a ray down followed by a ray up to restart the cluster and it did indeed cure the issue.

So there does appear to be some kind of incorrect caching/mixing of Python objects going on here.

I think it’s related to how ray pushes the function to workers. If the two functions has exactly the same signature/module/path, these are considered as the same function in ray cluster.

For your case, one workaround is just don’t run these in the same cluster. Or you need to make sure they are not in the same module name. @sangcho could you also confirm what I said here?

So to be clear, both of functions are used in the same worker? It is not like the worker is using the wrong version of code right?

So @sangcho we have two paths that look like this:

# Our production code
/production/main.py
/production/library.py

# Our test/development code (which periodically gets pushed to production)
/development/main.py
/development/library.py

We run main.py via ray exec in two different ways depending on whether this is a production job or a development test:

ray exec ray.yaml "cd /production ; python3 main.py"

-- OR ---

ray exec ray.yaml "cd /development ; python3 main.py"

In both cases, the it basically does this:

# main.py
import ray
import library

library.run()

Library file:

# library.py
def run() {
  ray.init(
    address='auto', 
    _redis_password='5241590000000000',
    job_config=ray.job_config.JobConfig(runtime_env={
                "env_vars": {"AUTOSCALER_EVENTS":"0"},
                "conda": "dan-1",
                })
    )
  <various calls are made to other functions in this library.py file>
)

<various other library functions>

All the production jobs and the development jobs are run on the same cluster.

According to the stack trace above from the production run, it was clearly using some functions cached from the production file and other functions cached from the development file.

I suppose we could try to isolate these jobs onto two separate clusters. But this is onerous from an operations perspective since someone can make a mistake and it would be very difficult to know that.

Perhaps this could be handled better by somehow adding filenames to the Ray function cache keys – or even better via file hashes?

I think that when I do a git pull on the production folder to pull in new files it is not guaranteed that the cluster workers will see them (unless someone does a ray down/ray up after each git pull).

Am I understanding all these issues correctly?

Before I write down some answers, I wonder what versions of Ray are you using can you tell us about it?

We are using Ray 1.6.0

Hi @sangcho , any thoughts?

@djakubiec sorry for missing this! I will get back to you soon. Can you ping me one more time if I don’t get back to you by tomorrow?

No worries, thanks @sangcho .

Hey @djakubiec sorry for being late. This thread was for some reasons slipped…

This is an unexpected behavior (that two job functions are intertwined to different workers) especially when you Ray 1.6. Ray doesn’t share the same workers for different jobs, so the only possibility this can happen is the function descriptor was somehow not properly calculated.

Perhaps this could be handled better by somehow adding filenames to the Ray function cache keys – or even better via file hashes?

I think this could be a solution, but before that, we should figure out the exact root cause. My question is, is it possible for you to provide a simple repro script and create a Github issue? we can prioritize fixing the issue since it seems to be a big blocker for your usability.

No problem, thanks @sangcho: