ERROR worker.py:382 -- SystemExit was raised from the worker

Hello,

I upgraded to Ray 1.3.0 from 1.2.0 and getting errors below (shorten). I’ve created two virtualenv using pyenv; one 1.2.0 that runs fine and 1.3.0 runs with errors. My gist has all relevant files to replicate the errors (script, requirements.txt, pip list, full error prints). The test script uses RLlib to train A3C against SimpleCorridor.

I am running on MacOS. Python 3.7.8.

(pid=45233) 2021-05-24 11:19:57,768     ERROR worker.py:382 -- SystemExit was raised from the worker
(pid=45233) Traceback (most recent call last):
(pid=45233)   File "python/ray/_raylet.pyx", line 495, in ray._raylet.execute_task
(pid=45233)   File "python/ray/_raylet.pyx", line 505, in ray._raylet.execute_task
(pid=45233)   File "python/ray/_raylet.pyx", line 449, in ray._raylet.execute_task.function_executor
(pid=45233)   File "/Users/rick.lan/.pyenv/versions/ray130tf1/lib/python3.7/site-packages/ray/_private/function_manager.py",
 line 556, in actor_method_executor
(pid=45233)     return method(__ray_actor, *args, **kwargs)
(pid=45233)   File "/Users/rick.lan/.pyenv/versions/ray130tf1/lib/python3.7/site-packages/ray/actor.py", line 1001, in __ray
_terminate__
(pid=45233)     ray.actor.exit_actor()
(pid=45233)   File "/Users/rick.lan/.pyenv/versions/ray130tf1/lib/python3.7/site-packages/ray/actor.py", line 1077, in exit_
actor
(pid=45233)     raise exit
(pid=45233) SystemExit: 0
(pid=45233)
(pid=45233) During handling of the above exception, another exception occurred:
(pid=45233)
(pid=45233) Traceback (most recent call last):
(pid=45233)   File "python/ray/_raylet.pyx", line 599, in ray._raylet.task_execution_handler
(pid=45233)   File "python/ray/_raylet.pyx", line 451, in ray._raylet.execute_task
(pid=45233)   File "python/ray/_raylet.pyx", line 488, in ray._raylet.execute_task
(pid=45233)   File "python/ray/includes/libcoreworker.pxi", line 33, in ray._raylet.ProfileEvent.__exit__
(pid=45233)   File "/Users/rick.lan/.pyenv/versions/3.7.8/lib/python3.7/traceback.py", line 167, in format_exc
(pid=45233)     return "".join(format_exception(*sys.exc_info(), limit=limit, chain=chain))
(pid=45233)   File "/Users/rick.lan/.pyenv/versions/3.7.8/lib/python3.7/traceback.py", line 121, in format_exception
(pid=45233)     type(value), value, tb, limit=limit).format(chain=chain))
(pid=45233)   File "/Users/rick.lan/.pyenv/versions/3.7.8/lib/python3.7/traceback.py", line 508, in __init__
(pid=45233)     capture_locals=capture_locals)
(pid=45233)   File "/Users/rick.lan/.pyenv/versions/3.7.8/lib/python3.7/traceback.py", line 363, in extract
(pid=45233)     f.line
(pid=45233)   File "/Users/rick.lan/.pyenv/versions/3.7.8/lib/python3.7/traceback.py", line 285, in line
(pid=45233)     self._line = linecache.getline(self.filename, self.lineno).strip()
(pid=45233)   File "/Users/rick.lan/.pyenv/versions/3.7.8/lib/python3.7/linecache.py", line 16, in getline
(pid=45233)     lines = getlines(filename, module_globals)
(pid=45233)   File "/Users/rick.lan/.pyenv/versions/3.7.8/lib/python3.7/linecache.py", line 47, in getlines
(pid=45233)     return updatecache(filename, module_globals)
(pid=45233)   File "/Users/rick.lan/.pyenv/versions/3.7.8/lib/python3.7/linecache.py", line 95, in updatecache
(pid=45233)     stat = os.stat(fullname)
(pid=45233)   File "/Users/rick.lan/.pyenv/versions/ray130tf1/lib/python3.7/site-packages/ray/worker.py", line 379, in sigte
rm_handler
(pid=45233)     sys.exit(1)
(pid=45233) SystemExit: 1

Gist

Can you show me log messages from this file in /tmp/ray/session_latest/logs after crash occurs?

# For your example, find a log file that has pid == 45233
python-core-worker-[worker_id]_[pid].log

(btw, here’s more information about ray logging; Logging — Ray v2.0.0.dev0)

Attached to the gist. There are 3 pid’s in the error prints. 45231, 45233 and 45235. The logs files are huge, making the gist only display the content of the one log file. Please use search to find others, e.g. “.log”.

Edit: thank you for sharing about logging. Good to know what’s under the hood.

This is the error message;

[2021-05-24 11:19:57,768 C 45231 22642740] core_worker.cc:723:  Check failed: _s.ok() Bad status: IOError: Broken pipe

Here, broken pipe means the raylet is crashed. Were you able to see dead nodes when you look at the dashboard by clicking the button below?

If so, can you also give me raylet.out log in that machine?

Runtime of the test script was very short. I didn’t open the dashboard.

Added to gist. Please search for raylet.out.

Do you run this in the cluster?

No, I think. I run the script:

python crash.py

I cannot access the crash.py in your gist. Can you check again if code is actually there?

There must be a size limit on gist. Among others, crash.py was removed, but the revision history records everything it seems. Here it is:

import ray		
from ray.rllib.examples.env.simple_corridor import SimpleCorridor		

 config={		
  "env": SimpleCorridor,		
  "env_config": {		
    "corridor_length": 10,		
  }		
}		

 stop = {		
  "training_iteration": 5,		
}		

 ray.init()		

 # Train		
results = ray.tune.run(		
  "A3C",		
  config=config,		
  stop=stop,		
)		

 ray.shutdown()

I repeated experiment with tensorflow 2 instead of tensorflow 1 and got similar error messages as before. Again on MacOS and pyenv virtualenv.

requirements.txt

tensorflow
#numpy==1.18.5
#gym
ray[rllib]
jupyterlab
seaborn
tqdm
(pid=69630) 2021-05-27 08:25:28,711     ERROR worker.py:382 -- SystemExit was raised from the worker
(pid=69630) Traceback (most recent call last):
(pid=69630)   File "python/ray/_raylet.pyx", line 495, in ray._raylet.execute_task
(pid=69630)   File "python/ray/_raylet.pyx", line 505, in ray._raylet.execute_task
(pid=69630)   File "python/ray/_raylet.pyx", line 449, in ray._raylet.execute_task.function_executor
(pid=69630)   File "/Users/rick.lan/.pyenv/versions/3.7.8-ray130-tf2/lib/python3.7/site-packages/ray/_private/function_manager
.py", line 556, in actor_method_executor
(pid=69630)     return method(__ray_actor, *args, **kwargs)
(pid=69630)   File "/Users/rick.lan/.pyenv/versions/3.7.8-ray130-tf2/lib/python3.7/site-packages/ray/actor.py", line 1001, in
__ray_terminate__
(pid=69630)     ray.actor.exit_actor()
(pid=69630)   File "/Users/rick.lan/.pyenv/versions/3.7.8-ray130-tf2/lib/python3.7/site-packages/ray/actor.py", line 1077, in
exit_actor
(pid=69630)     raise exit
(pid=69630) SystemExit: 0
(pid=69630)
(pid=69630) During handling of the above exception, another exception occurred:
(pid=69630)
(pid=69630) Traceback (most recent call last):
(pid=69630)   File "python/ray/_raylet.pyx", line 599, in ray._raylet.task_execution_handler
(pid=69630)   File "python/ray/_raylet.pyx", line 451, in ray._raylet.execute_task
(pid=69630)   File "python/ray/_raylet.pyx", line 488, in ray._raylet.execute_task
(pid=69630)   File "python/ray/includes/libcoreworker.pxi", line 33, in ray._raylet.ProfileEvent.__exit__
(pid=69630)   File "/Users/rick.lan/.pyenv/versions/3.7.8/lib/python3.7/traceback.py", line 167, in format_exc
(pid=69630)     return "".join(format_exception(*sys.exc_info(), limit=limit, chain=chain))
(pid=69630)   File "/Users/rick.lan/.pyenv/versions/3.7.8/lib/python3.7/traceback.py", line 121, in format_exception
(pid=69630)     type(value), value, tb, limit=limit).format(chain=chain))
(pid=69630)   File "/Users/rick.lan/.pyenv/versions/3.7.8/lib/python3.7/traceback.py", line 611, in format
(pid=69630)     yield from self.format_exception_only()
(pid=69630)   File "/Users/rick.lan/.pyenv/versions/3.7.8/lib/python3.7/traceback.py", line 566, in format_exception_only
(pid=69630)     yield _format_final_exc_line(stype, self._str)
(pid=69630)   File "/Users/rick.lan/.pyenv/versions/3.7.8-ray130-tf2/lib/python3.7/site-packages/ray/worker.py", line 379, in
sigterm_handler
(pid=69630)     sys.exit(1)
(pid=69630) SystemExit: 1

Also repeated the original experiment, Ray 1.2.0 vs 1.3.0 both Tensorflow 1.15, but now on an Ubuntu 20.04 LTS server. Same behavior: 1.2.0 is fine, but 1.3.0 errors:

(pid=1154810) 2021-05-26 23:35:47,328   ERROR worker.py:382 -- SystemExit was raised from the worker
(pid=1154810) Traceback (most recent call last):
(pid=1154810)   File "python/ray/_raylet.pyx", line 495, in ray._raylet.execute_task
(pid=1154810)   File "python/ray/_raylet.pyx", line 505, in ray._raylet.execute_task
(pid=1154810)   File "python/ray/_raylet.pyx", line 449, in ray._raylet.execute_task.function_executor
(pid=1154810)   File "/home/rick/.pyenv/versions/3.7.8-ray130-tf1/lib/python3.7/site-packages/ray/_private/function_manager.py", line 556, in actor_method_executor
(pid=1154810)     return method(__ray_actor, *args, **kwargs)
(pid=1154810)   File "/home/rick/.pyenv/versions/3.7.8-ray130-tf1/lib/python3.7/site-packages/ray/actor.py", line 1001, in __ray_terminate__
(pid=1154810)     ray.actor.exit_actor()
(pid=1154810)   File "/home/rick/.pyenv/versions/3.7.8-ray130-tf1/lib/python3.7/site-packages/ray/actor.py", line 1077, in exit_actor
(pid=1154810)     raise exit
(pid=1154810) SystemExit: 0
(pid=1154810)
(pid=1154810) During handling of the above exception, another exception occurred:
(pid=1154810)
(pid=1154810) Traceback (most recent call last):
(pid=1154810)   File "python/ray/_raylet.pyx", line 599, in ray._raylet.task_execution_handler
(pid=1154810)   File "python/ray/_raylet.pyx", line 451, in ray._raylet.execute_task
(pid=1154810)   File "python/ray/_raylet.pyx", line 488, in ray._raylet.execute_task
(pid=1154810)   File "python/ray/includes/libcoreworker.pxi", line 33, in ray._raylet.ProfileEvent.__exit__
(pid=1154810)   File "/home/rick/.pyenv/versions/3.7.8/lib/python3.7/traceback.py", line 167, in format_exc
(pid=1154810)     return "".join(format_exception(*sys.exc_info(), limit=limit, chain=chain))
(pid=1154810)   File "/home/rick/.pyenv/versions/3.7.8/lib/python3.7/traceback.py", line 121, in format_exception
(pid=1154810)     type(value), value, tb, limit=limit).format(chain=chain))
(pid=1154810)   File "/home/rick/.pyenv/versions/3.7.8/lib/python3.7/traceback.py", line 508, in __init__
(pid=1154810)     capture_locals=capture_locals)
(pid=1154810)   File "/home/rick/.pyenv/versions/3.7.8/lib/python3.7/traceback.py", line 363, in extract
(pid=1154810)     f.line
(pid=1154810)   File "/home/rick/.pyenv/versions/3.7.8/lib/python3.7/traceback.py", line 285, in line
(pid=1154810)     self._line = linecache.getline(self.filename, self.lineno).strip()
(pid=1154810)   File "/home/rick/.pyenv/versions/3.7.8/lib/python3.7/linecache.py", line 16, in getline
(pid=1154810)     lines = getlines(filename, module_globals)
(pid=1154810)   File "/home/rick/.pyenv/versions/3.7.8/lib/python3.7/linecache.py", line 47, in getlines
(pid=1154810)     return updatecache(filename, module_globals)
(pid=1154810)   File "/home/rick/.pyenv/versions/3.7.8/lib/python3.7/linecache.py", line 137, in updatecache
(pid=1154810)     lines = fp.readlines()
(pid=1154810)   File "/home/rick/.pyenv/versions/3.7.8/lib/python3.7/codecs.py", line 319, in decode
(pid=1154810)     def decode(self, input, final=False):
(pid=1154810)   File "/home/rick/.pyenv/versions/3.7.8-ray130-tf1/lib/python3.7/site-packages/ray/worker.py", line 379, in sigterm_handler
(pid=1154810)     sys.exit(1)
(pid=1154810) SystemExit: 1

Trying a different tensorflow version and different OS did not workaround the problem. I now try a different python version: 3.8.10. Tensorflow 1.x is not supported on that version but Tensorflow 2 is. On both MacOS and Ubuntu 20.04. Ray 1.3.0. Both runs without errors.

I ran one more experiment and summarize all experiments:

Mac OS 10.15.7 Ubuntu 20.04 LTS Python 3.7.8 Python 3.8.10 Ray 1.2.0 Ray 1.3.0 Tensorflow 1.15.4 Tensorflow 2.5.0 Errors
x x x x N
x x x x N
x x x x Y
x x x x Y
x x x x Y
x x x x N
x x x x N
x x x x N

It seems that the errors appear with Python 3.7.8 and Ray 1.3.0 combo.
On GCP, I am using Python 3.7.8 with both a Debian 9 and Debian 10 images. I haven’t yet tried Ray 1.3.0. I want to move from Ray 1.2 to 1.3. I worry it will break my VM install. The Debian 10 image is quite new. May effect many people.

FYI @rliaw

So, based on the table, python 3.7 has some issues?

Would you like to actually do some sort of pair debugging for this? I think it is a little difficult to debug through the chat. also cc @rliaw please follow up if you saw anything similar before.

Before we go to pair debugging, let me ask a stupid question. Are you able to duplicate my errors? If not, what is your environment setup? Perhaps I’ll duplicate that in my environment, as a workaround for now.

Not much to add here, but just wanted to note that I’ve been getting this error ever since we migrated to 1.3.0. It’s pretty harmless in our case – showing up only after the Tune experiment ends – and doesn’t seem to affect anything. Will add more details if things change.

Env: Python3.7.5, Ray 1.3.0 – for an application that uses Tune with Pytorch and Tensorflow.

@Vishnu That’s good to know! Thank you for sharing.

cc @sven1977 Please comment if you know any similar issue from rllib!

I am running in my local Mac OS Catalina 10.15.7 with “conda” environment, and code you posted seems to work.

@Vishnu Where is your python environment running in? Are you using conda or pyenv or something else?