ModuleNotFoundError from the cluster

Hi, I’m new to Ray and trying to parallelize my calc by a cluster, but I encountered ‘ModuleNotFoundError’ from some of my remote calls and can’t get a clue what actually happened.

  • Environment:
    ** I have a cluster of 4 nodes, one for the head. The head node is started by ‘ray start --head --gcs-server-port=40678 --port=9736’ and worker nodes are started by 'ray start --address=‘xxxx:9736’ --redis-password=‘xxxxx’
    ** After starting the head and all workers, I’m able to see them from the dashboard (and I assume the cluster is working fine)
  • In my calculation script, I use ray.init(–address=‘xxxx’ --redis-password=‘5241590000000000’) to connect to the cluster and launch about 100 tasks.
  • I run my calculation from the head node, e.g. ‘python test.py’
  • Errors under different scenarios:
    ** In previous set up, say I have more than 1 worker nodes, tasks scheduled to worker nodes fail on ‘ModuleNotFoundError, no module named ‘my-own-package’’.
    ** If I stop all other works, only keep the head, I’m able to finish my calc and using all resources available at the head node.

Error message is like below:

2021-05-19 14:46:18,715 ERROR worker.py:1056 – Possible unhandled error from worker: ray::price_cash_flows_batch() (pid=96919, ip=10.23.186.153)
File “python/ray/_raylet.pyx”, line 458, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 479, in ray._raylet.execute_task
File “python/ray/_raylet.pyx”, line 349, in ray._raylet.raise_if_dependency_failed
ray.exceptions.RaySystemError: System error: No module named ‘bct’
traceback: Traceback (most recent call last):
File “/root/anaconda3/envs/risk-engine/lib/python3.8/site-packages/ray/serialization.py”, line 246, in deserialize_objects
obj = self._deserialize_object(data, metadata, object_ref)
File “/root/anaconda3/envs/risk-engine/lib/python3.8/site-packages/ray/serialization.py”, line 188, in _deserialize_object
return self._deserialize_msgpack_data(data, metadata_fields)
File “/root/anaconda3/envs/risk-engine/lib/python3.8/site-packages/ray/serialization.py”, line 166, in _deserialize_msgpack_data
python_objects = self._deserialize_pickle5_data(pickle5_data)
File “/root/anaconda3/envs/risk-engine/lib/python3.8/site-packages/ray/serialization.py”, line 156, in _deserialize_pickle5_data
obj = pickle.loads(in_band)
ModuleNotFoundError: No module named ‘bct’

It seems to me the python path is not correctly set from the remote, but I have no idea what’s going wrong and how to fix it since I’m not using Ray Cluster.

Any idea?

Thanks,
-BS

I think I know what happened. I didn’t copy the source code to all remotes … After I did that, the ModuleNotFoundError is gone.

1 Like

Glad you sorted this out!

Hi, How did you copy the source_code to all remotes?

I just copy the whole project to the same absolute path, in my case which is /root, on all remote hosts