Hi all, this is a simple test using RaySGD wrapper, with exactly the same code listed here, RaySGD: Distributed Training Wrappers — Ray v1.6.0. It seems that the code is done, as it shows the “success” in the output. However, it constantly gives the error thrown from the Ray’s internal workers. We tried it on various types of machines that give the same error. BTW, if the number of worker is 1, it is okay with no error. As long as it is larger than 1, it throws the error. Could you help to debug this?
Ray version is 1.6.0. The torch version is 1.9.0, torchvision is 0.10.0.
2021-10-02 00:18:41,150 INFO services.py:1263 – View the Ray dashboard at http://127.0.0.1:8265
(pid=3716485) 2021-10-02 00:18:44,170 INFO distributed_torch_runner.py:58 – Setting up process group for: tcp://10.248.159.106:12115 [rank=1]
(pid=3716488) 2021-10-02 00:18:44,109 INFO distributed_torch_runner.py:58 – Setting up process group for: tcp://10.248.159.106:12115 [rank=0]
{‘num_samples’: 1000, ‘epoch’: 1.0, ‘batch_count’: 8.0, ‘train_loss’: 36.19136633300781, ‘last_train_loss’: 3.6160049438476562}
success!
(pid=3716485) 2021-10-02 00:18:44,260 ERROR worker.py:428 – SystemExit was raised from the worker
(pid=3716485) Traceback (most recent call last):
(pid=3716485) File “python/ray/_raylet.pyx”, line 640, in ray._raylet.task_execution_handler
(pid=3716485) File “python/ray/_raylet.pyx”, line 488, in ray._raylet.execute_task
(pid=3716485) File “python/ray/_raylet.pyx”, line 525, in ray._raylet.execute_task
(pid=3716485) File “python/ray/_raylet.pyx”, line 532, in ray._raylet.execute_task
(pid=3716485) File “python/ray/_raylet.pyx”, line 536, in ray._raylet.execute_task
(pid=3716485) File “python/ray/_raylet.pyx”, line 486, in ray._raylet.execute_task.function_executor
(pid=3716485) File “/home/xin.chen/anaconda3/lib/python3.8/site-packages/ray/_private/function_manager.py”, line 563, in actor_method_executor
(pid=3716485) return method(__ray_actor, *args, **kwargs)
(pid=3716485) File “/home/xin.chen/anaconda3/lib/python3.8/site-packages/ray/actor.py”, line 1047, in ray_terminate
(pid=3716485) ray.actor.exit_actor()
(pid=3716485) File “/home/xin.chen/anaconda3/lib/python3.8/site-packages/ray/actor.py”, line 1109, in exit_actor
(pid=3716485) ray.worker.disconnect()
(pid=3716485) File “/home/xin.chen/anaconda3/lib/python3.8/site-packages/ray/worker.py”, line 1491, in disconnect
(pid=3716485) worker.import_thread.join_import_thread()
(pid=3716485) File “/home/xin.chen/anaconda3/lib/python3.8/site-packages/ray/_private/import_thread.py”, line 50, in join_import_thread
(pid=3716485) self.t.join()
(pid=3716485) File “/home/xin.chen/anaconda3/lib/python3.8/threading.py”, line 1011, in join
(pid=3716485) self._wait_for_tstate_lock()
(pid=3716485) File “/home/xin.chen/anaconda3/lib/python3.8/threading.py”, line 1027, in _wait_for_tstate_lock
(pid=3716485) elif lock.acquire(block, timeout):
(pid=3716485) File “/home/xin.chen/anaconda3/lib/python3.8/site-packages/ray/worker.py”, line 425, in sigterm_handler
(pid=3716485) sys.exit(1)
(pid=3716485) SystemExit: 1
(pid=3716488) 2021-10-02 00:18:44,257 ERROR worker.py:428 – SystemExit was raised from the worker
(pid=3716488) Traceback (most recent call last):
(pid=3716488) File “python/ray/_raylet.pyx”, line 532, in ray._raylet.execute_task
(pid=3716488) File “python/ray/_raylet.pyx”, line 536, in ray._raylet.execute_task
(pid=3716488) File “python/ray/_raylet.pyx”, line 486, in ray._raylet.execute_task.function_executor
(pid=3716488) File “/home/xin.chen/anaconda3/lib/python3.8/site-packages/ray/_private/function_manager.py”, line 563, in actor_method_executor
(pid=3716488) return method(__ray_actor, *args, **kwargs)
(pid=3716488) File “/home/xin.chen/anaconda3/lib/python3.8/site-packages/ray/actor.py”, line 1047, in ray_terminate
(pid=3716488) ray.actor.exit_actor()
(pid=3716488) File “/home/xin.chen/anaconda3/lib/python3.8/site-packages/ray/actor.py”, line 1123, in exit_actor
(pid=3716488) raise exit
(pid=3716488) SystemExit: 0
(pid=3716488)
(pid=3716488) During handling of the above exception, another exception occurred:
(pid=3716488)
(pid=3716488) Traceback (most recent call last):
(pid=3716488) File “python/ray/_raylet.pyx”, line 640, in ray._raylet.task_execution_handler
(pid=3716488) File “python/ray/_raylet.pyx”, line 488, in ray._raylet.execute_task
(pid=3716488) File “python/ray/_raylet.pyx”, line 525, in ray._raylet.execute_task
(pid=3716488) File “python/ray/includes/libcoreworker.pxi”, line 33, in ray._raylet.ProfileEvent.exit
(pid=3716488) File “/home/xin.chen/anaconda3/lib/python3.8/traceback.py”, line 167, in format_exc
(pid=3716488) return “”.join(format_exception(*sys.exc_info(), limit=limit, chain=chain))
(pid=3716488) File “/home/xin.chen/anaconda3/lib/python3.8/traceback.py”, line 120, in format_exception
(pid=3716488) return list(TracebackException(
(pid=3716488) File “/home/xin.chen/anaconda3/lib/python3.8/traceback.py”, line 508, in init
(pid=3716488) self.stack = StackSummary.extract(
(pid=3716488) File “/home/xin.chen/anaconda3/lib/python3.8/traceback.py”, line 366, in extract
(pid=3716488) f.line
(pid=3716488) File “/home/xin.chen/anaconda3/lib/python3.8/traceback.py”, line 288, in line
(pid=3716488) self._line = linecache.getline(self.filename, self.lineno).strip()
(pid=3716488) File “/home/xin.chen/anaconda3/lib/python3.8/linecache.py”, line 16, in getline
(pid=3716488) lines = getlines(filename, module_globals)
(pid=3716488) File “/home/xin.chen/anaconda3/lib/python3.8/linecache.py”, line 47, in getlines
(pid=3716488) return updatecache(filename, module_globals)
(pid=3716488) File “/home/xin.chen/anaconda3/lib/python3.8/linecache.py”, line 136, in updatecache
(pid=3716488) with tokenize.open(fullname) as fp:
(pid=3716488) File “/home/xin.chen/anaconda3/lib/python3.8/tokenize.py”, line 392, in open
(pid=3716488) buffer = _builtin_open(filename, ‘rb’)
(pid=3716488) File “/home/xin.chen/anaconda3/lib/python3.8/site-packages/ray/worker.py”, line 425, in sigterm_handler
(pid=3716488) sys.exit(1)
(pid=3716488) SystemExit: 1