Ray tasks mixing functions from different files

djakubiec · September 28, 2021, 4:19pm

I got a bizarre stack trace back from one of my Ray tasks that runs functions and classes out of a local Python file we have called “djlearn.py”.

If you look at the last two lines of the stack trace below you can see functions coming from two different versions of the same file:

/ceph/var/elcano/main/python.focusvq/focusvq/djakubiec/djlearn.py
/ceph/home/djakubiec/elcano/python.focusvq/focusvq/djakubiec/djlearn.py

One of these is our production version of the file, the other one is a development version. We do run both kinds of jobs on this same Ray cluster, but we certainly don’t intermingle them in the same ray exec <script> runs.

I am guessing that the cluster has somehow cached the functions from these files at different times, and then somehow mixed them during this latest script run?

This seems like a Ray bug to me, or maybe we are breaking some rules about how to use Ray? Like perhaps “don’t mix file versions on the same Ray cluster” or something?

Can someone please advise, thank you!

Traceback (most recent call last):
  File "mnesModel4.py", line 771, in <module>
    models.run()
  File "/ceph/var/elcano/main/python.focusvq/focusvq/djakubiec/djlearn.py", line 182, in run
    self.processGroups()
  File "/ceph/var/elcano/main/python.focusvq/focusvq/djakubiec/djlearn.py", line 405, in processGroups
    groupData = ray.get(ready)
  File "/home/ray/anaconda3/envs/dan-1/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 82, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/envs/dan-1/lib/python3.8/site-packages/ray/worker.py", line 1621, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): ray::rayProcessGroupTrain() (pid=102573, ip=10.0.1.34)
  File "/ceph/var/elcano/main/python.focusvq/focusvq/djakubiec/djlearn.py", line 725, in rayProcessGroupTrain
    return processGroupTrain(parameters, sourceFrame, groupData)
  File "/ceph/home/djakubiec/elcano/python.focusvq/focusvq/djakubiec/djlearn.py", line 758, in processGroupTrain
    log.info(f"Training group {groupData['groupIndex']}: {groupData['testGroups'][0]}, randomize {parameters.randomizeTestFeatures}/{parameters.randomizeAllFeatures}/{parameters.addGoalFeature}")
AttributeError: 'Namespace' object has no attribute 'randomizeAllFeatures'
Shared connection to 10.0.1.34 closed.

djakubiec · September 28, 2021, 4:33pm

I was able to reproduce this stack trace consistently.

I then did a ray down followed by a ray up to restart the cluster and it did indeed cure the issue.

So there does appear to be some kind of incorrect caching/mixing of Python objects going on here.

yic · September 29, 2021, 5:56am

I think it’s related to how ray pushes the function to workers. If the two functions has exactly the same signature/module/path, these are considered as the same function in ray cluster.

For your case, one workaround is just don’t run these in the same cluster. Or you need to make sure they are not in the same module name. @sangcho could you also confirm what I said here?

sangcho · September 29, 2021, 6:33am

So to be clear, both of functions are used in the same worker? It is not like the worker is using the wrong version of code right?

djakubiec · September 29, 2021, 4:03pm

So @sangcho we have two paths that look like this:

# Our production code
/production/main.py
/production/library.py

# Our test/development code (which periodically gets pushed to production)
/development/main.py
/development/library.py

We run main.py via ray exec in two different ways depending on whether this is a production job or a development test:

ray exec ray.yaml "cd /production ; python3 main.py"

-- OR ---

ray exec ray.yaml "cd /development ; python3 main.py"

In both cases, the it basically does this:

# main.py
import ray
import library

library.run()

Library file:

# library.py
def run() {
  ray.init(
    address='auto', 
    _redis_password='5241590000000000',
    job_config=ray.job_config.JobConfig(runtime_env={
                "env_vars": {"AUTOSCALER_EVENTS":"0"},
                "conda": "dan-1",
                })
    )
  <various calls are made to other functions in this library.py file>
)

<various other library functions>

All the production jobs and the development jobs are run on the same cluster.

According to the stack trace above from the production run, it was clearly using some functions cached from the production file and other functions cached from the development file.

I suppose we could try to isolate these jobs onto two separate clusters. But this is onerous from an operations perspective since someone can make a mistake and it would be very difficult to know that.

Perhaps this could be handled better by somehow adding filenames to the Ray function cache keys – or even better via file hashes?

I think that when I do a git pull on the production folder to pull in new files it is not guaranteed that the cluster workers will see them (unless someone does a ray down/ray up after each git pull).

Am I understanding all these issues correctly?

sangcho · September 30, 2021, 6:40am

Before I write down some answers, I wonder what versions of Ray are you using can you tell us about it?

djakubiec · September 30, 2021, 1:44pm

We are using Ray 1.6.0

djakubiec · October 5, 2021, 10:07pm

Hi @sangcho , any thoughts?

sangcho · October 5, 2021, 11:04pm

@djakubiec sorry for missing this! I will get back to you soon. Can you ping me one more time if I don’t get back to you by tomorrow?

djakubiec · October 6, 2021, 8:18pm

No worries, thanks @sangcho .

sangcho · October 13, 2021, 9:16am

Hey @djakubiec sorry for being late. This thread was for some reasons slipped…

This is an unexpected behavior (that two job functions are intertwined to different workers) especially when you Ray 1.6. Ray doesn’t share the same workers for different jobs, so the only possibility this can happen is the function descriptor was somehow not properly calculated.

Perhaps this could be handled better by somehow adding filenames to the Ray function cache keys – or even better via file hashes?

I think this could be a solution, but before that, we should figure out the exact root cause. My question is, is it possible for you to provide a simple repro script and create a Github issue? we can prioritize fixing the issue since it seems to be a big blocker for your usability.

djakubiec · October 13, 2021, 6:18pm

No problem, thanks @sangcho:

github.com/ray-project/ray

[Bug] Ray tasks mixing functions from different files

opened 06:17PM - 13 Oct 21 UTC

djakubiec

bug triage

### Search before asking - [X] I searched the [issues](https://github.com/ray-p…roject/ray/issues) and found no similar issues. ### Ray Component Ray Core ### What happened + What you expected to happen (Issued copied from [https://discuss.ray.io/t/ray-tasks-mixing-functions-from-different-files/3657](url)) When running multiple versions of the same Python local import on the same cluster, the cluster caches and mixes functions from those module files in weird ways. Assume you have these two versions of a project checked out on your filesystem (file contents included below): ``` /ceph/var/users/djakubiec/tmp/raymix/v1/mylib.py /ceph/var/users/djakubiec/tmp/raymix/v1/run.py --AND-- /ceph/var/users/djakubiec/tmp/raymix/v2/mylib.py /ceph/var/users/djakubiec/tmp/raymix/v2/run.py ``` If you run v1 first on a Ray cluster it works as expected: ``` ⇒ ray exec /ceph/var/ray/tools/docker/ray04/ray.yaml "cd /ceph/var/users/djakubiec/tmp/raymix/v1 ; python3 run.py" Loaded cached provider configuration If you experience issues with the cloud provider, try re-running the command with --no-config-cache. Fetched IP: 10.0.1.34 Warning: Permanently added '10.0.1.34' (ECDSA) to the list of known hosts. 2021-10-13 10:49:28,323 INFO worker.py:825 -- Connecting to existing Ray cluster at address: 10.0.1.34:12001 1 Shared connection to 10.0.1.34 closed. ``` v2 of the file intentionally raises an exception. But if you run v2 on the same Ray cluster after running v1, you can see the stack trace mixing functions from both the v1 and v2 files (see last two functions): ``` ⇒ ray exec /ceph/var/ray/tools/docker/ray04/ray.yaml "cd /ceph/var/users/djakubiec/tmp/raymix/v2 ; python3 run.py" 15961ms Loaded cached provider configuration If you experience issues with the cloud provider, try re-running the command with --no-config-cache. Fetched IP: 10.0.1.34 Warning: Permanently added '10.0.1.34' (ECDSA) to the list of known hosts. 2021-10-13 10:50:26,514 INFO worker.py:825 -- Connecting to existing Ray cluster at address: 10.0.1.34:12001 Traceback (most recent call last): File "run.py", line 6, in <module> print(ray.get(mylib.task.remote(2))) File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 82, in wrapper return func(*args, **kwargs) File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/worker.py", line 1621, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(ValueError): ray::task() (pid=5065, ip=10.0.1.34) File "/ceph/var/users/djakubiec/tmp/raymix/v2/mylib.py", line 7, in task return subtask() File "/ceph/var/users/djakubiec/tmp/raymix/v1/mylib.py", line 10, in subtask raise ValueError(f"File version 1") ValueError: File version 1 Shared connection to 10.0.1.34 closed. ``` It looks like the cluster has cached the functions from these files at different times, and then somehow mixed them during this latest script run. Per the discussion thread, it would be nice to cache functions using something like file hashes to handle common cases where files are updated, checked out in multiple locations, etc. ### Versions / Dependencies Ray 1.6.0 Python 3.8.5 ### Reproduction script /ceph/var/users/djakubiec/tmp/raymix/v1/mylib.py ``` import ray @ray.remote(num_cpus=1) def task(parameter): if parameter == 1: return 1 return subtask() def subtask(): raise ValueError(f"File version 1") ``` /ceph/var/users/djakubiec/tmp/raymix/v1/run.py ``` import ray import mylib ray.init(address='auto', _redis_password='5241590000000000') print(ray.get(mylib.task.remote(1))) ``` --AND-- /ceph/var/users/djakubiec/tmp/raymix/v2/mylib.py ``` import ray @ray.remote(num_cpus=1) def task(parameter): if parameter == 1: return 1 return subtask() def subtask(): raise ValueError(f"File version 2") ``` /ceph/var/users/djakubiec/tmp/raymix/v2/run.py ``` import ray import mylib ray.init(address='auto', _redis_password='5241590000000000') print(ray.get(mylib.task.remote(2))) ``` ### Anything else Occurs every time. Condition can be clear via ```ray down``` followed by ```ray up```. ### Are you willing to submit a PR? - [ ] Yes I am willing to submit a PR!

Topic		Replies	Views
Cluster dosen't distribute workloads Ray Clusters	0	407	November 10, 2022
Couldn't use existing ray cluster from python file - TypeError: can't pickle function objects Ray Core	3	461	August 23, 2022
CPU cores, CPU threads, and scaling of Ray tasks Ray Core	1	278	June 25, 2024
Same code gets different output Ray Core	5	668	July 29, 2022
Raylet worker processes are failing Ray Core	3	402	March 5, 2025

Ray tasks mixing functions from different files

Related topics