I got a bizarre stack trace back from one of my Ray tasks that runs functions and classes out of a local Python file we have called “djlearn.py”.
If you look at the last two lines of the stack trace below you can see functions coming from two different versions of the same file:
/ceph/var/elcano/main/python.focusvq/focusvq/djakubiec/djlearn.py
/ceph/home/djakubiec/elcano/python.focusvq/focusvq/djakubiec/djlearn.py
One of these is our production version of the file, the other one is a development version. We do run both kinds of jobs on this same Ray cluster, but we certainly don’t intermingle them in the same ray exec <script>
runs.
I am guessing that the cluster has somehow cached the functions from these files at different times, and then somehow mixed them during this latest script run?
This seems like a Ray bug to me, or maybe we are breaking some rules about how to use Ray? Like perhaps “don’t mix file versions on the same Ray cluster” or something?
Can someone please advise, thank you!
Traceback (most recent call last):
File "mnesModel4.py", line 771, in <module>
models.run()
File "/ceph/var/elcano/main/python.focusvq/focusvq/djakubiec/djlearn.py", line 182, in run
self.processGroups()
File "/ceph/var/elcano/main/python.focusvq/focusvq/djakubiec/djlearn.py", line 405, in processGroups
groupData = ray.get(ready)
File "/home/ray/anaconda3/envs/dan-1/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 82, in wrapper
return func(*args, **kwargs)
File "/home/ray/anaconda3/envs/dan-1/lib/python3.8/site-packages/ray/worker.py", line 1621, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): ray::rayProcessGroupTrain() (pid=102573, ip=10.0.1.34)
File "/ceph/var/elcano/main/python.focusvq/focusvq/djakubiec/djlearn.py", line 725, in rayProcessGroupTrain
return processGroupTrain(parameters, sourceFrame, groupData)
File "/ceph/home/djakubiec/elcano/python.focusvq/focusvq/djakubiec/djlearn.py", line 758, in processGroupTrain
log.info(f"Training group {groupData['groupIndex']}: {groupData['testGroups'][0]}, randomize {parameters.randomizeTestFeatures}/{parameters.randomizeAllFeatures}/{parameters.addGoalFeature}")
AttributeError: 'Namespace' object has no attribute 'randomizeAllFeatures'
Shared connection to 10.0.1.34 closed.