EOFError: marshal data too short

Can someone please explain what this error mean?
Thanks in advance!

2021-03-17 20:16:23,660	ERROR worker.py:1053 -- Possible unhandled error from worker: e[36mray::_Trainable.__init__()e[39m (pid=20389, ip=172.20.201.13)
  File "python/ray/_raylet.pyx", line 442, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 459, in ray._raylet.execute_task
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/serialization.py", line 245, in deserialize_objects
    self._deserialize_object(data, metadata, object_ref))
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/serialization.py", line 192, in _deserialize_object
    return self._deserialize_msgpack_data(data, metadata_fields)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/serialization.py", line 170, in _deserialize_msgpack_data
    python_objects = self._deserialize_pickle5_data(pickle5_data)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/serialization.py", line 160, in _deserialize_pickle5_data
    obj = pickle.loads(in_band)
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/tune_sklearn/__init__.py", line 1, in <module>
    from tune_sklearn.tune_gridsearch import TuneGridSearchCV
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/tune_sklearn/tune_gridsearch.py", line 9, in <module>
    from sklearn.model_selection import ParameterGrid
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/sklearn/model_selection/__init__.py", line 21, in <module>
    from ._validation import cross_val_score
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 30, in <module>
    from ..metrics import check_scoring
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/sklearn/metrics/__init__.py", line 7, in <module>
    from ._ranking import auc
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 779, in exec_module
  File "<frozen importlib._bootstrap_external>", line 911, in get_code
  File "<frozen importlib._bootstrap_external>", line 580, in _compile_bytecode
EOFError: marshal data too short

I’m not familiar with this error, but from a couple of google searches it seems like this may be related to deserializing the data in the trainable.

What kind of data are you training on? Are these very large datasets? Can you share a bit more context?

The data that I’m trying to train on is around 700MB. It’s not that large and these scripts (sklearn classifiers with TuneSearchCV) were working fine before. For some reason, they stopped working entirely since yesterday. Seeing the same error with 30MB training data.

Here’s the same error on smaller dataset -

Traceback (most recent call last):
  File "Tuning/RF.py", line 8, in <module>
    from tune_sklearn import TuneSearchCV
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/tune_sklearn/__init__.py", line 1, in <module>
    from tune_sklearn.tune_gridsearch import TuneGridSearchCV
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/tune_sklearn/tune_gridsearch.py", line 9, in <module>
    from sklearn.model_selection import ParameterGrid
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/sklearn/model_selection/__init__.py", line 21, in <module>
    from ._validation import cross_val_score
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 30, in <module>
    from ..metrics import check_scoring
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/sklearn/metrics/__init__.py", line 7, in <module>
    from ._ranking import auc
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 779, in exec_module
  File "<frozen importlib._bootstrap_external>", line 911, in get_code
  File "<frozen importlib._bootstrap_external>", line 580, in _compile_bytecode
EOFError: marshal data too short

Also, I’m running the python scripts in server using SLURM scheduler.

I’ve observed that the tuning script is working well the first time but if I use it the 2nd time with a different dataset, it throws this error. Is it normal/known issue?

Traceback (most recent call last):
  File "/home/tmamidi/.conda/envs/training/bin/ray", line 5, in <module>
    from ray.scripts.scripts import main
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/__init__.py", line 63, in <module>
    import ray._raylet  # noqa: E402
  File "python/ray/_raylet.pyx", line 107, in init ray._raylet
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/exceptions.py", line 6, in <module>
    import ray.cloudpickle as pickle
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/cloudpickle/__init__.py", line 7, in <module>
    from ray.cloudpickle.cloudpickle_fast import CloudPickler, dumps, dump  # noqa
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 487, in <module>
    class CloudPickler(Pickler):
  File "/home/tmamidi/.conda/envs/training/lib/python3.8/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 514, in CloudPickler
    import numpy.core
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 779, in exec_module
  File "<frozen importlib._bootstrap_external>", line 911, in get_code
  File "<frozen importlib._bootstrap_external>", line 580, in _compile_bytecode
EOFError: marshal data too short
srun: error: c0124: task 0: Exited with exit code 1
Traceback (most recent call last):
  File "Tuning/SGD.py", line 2, in <module>
    import numpy as np
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 779, in exec_module
  File "<frozen importlib._bootstrap_external>", line 911, in get_code
  File "<frozen importlib._bootstrap_external>", line 580, in _compile_bytecode
EOFError: marshal data too short

I might have found the solution/s. Couple things to note -

  1. I think this has something to do with conda environment. I’m using most of the packages from anaconda/conda-forge but packages like ray are from pip. I don’t think the conda version that I’m using can check/use the dependency versions from conda to pip.

So, Is it possible for developers to publish a conda package along with pip?

  1. This is my end of the problem. I wasn’t initializing the conda environment when logging into the cluster.

Both of the combined modifications helped me from getting similar error as I posted above.