Errors when test TorchTrainer with the "getting started" code

Hi all, this is a simple test using RaySGD wrapper, with exactly the same code listed here, RaySGD: Distributed Training Wrappers — Ray v1.6.0. It seems that the code is done, as it shows the “success” in the output. However, it constantly gives the error thrown from the Ray’s internal workers. We tried it on various types of machines that give the same error. BTW, if the number of worker is 1, it is okay with no error. As long as it is larger than 1, it throws the error. Could you help to debug this?

Ray version is 1.6.0. The torch version is 1.9.0, torchvision is 0.10.0.

2021-10-02 00:18:41,150 INFO services.py:1263 – View the Ray dashboard at http://127.0.0.1:8265
(pid=3716485) 2021-10-02 00:18:44,170 INFO distributed_torch_runner.py:58 – Setting up process group for: tcp://10.248.159.106:12115 [rank=1]
(pid=3716488) 2021-10-02 00:18:44,109 INFO distributed_torch_runner.py:58 – Setting up process group for: tcp://10.248.159.106:12115 [rank=0]
{‘num_samples’: 1000, ‘epoch’: 1.0, ‘batch_count’: 8.0, ‘train_loss’: 36.19136633300781, ‘last_train_loss’: 3.6160049438476562}
success!
(pid=3716485) 2021-10-02 00:18:44,260 ERROR worker.py:428 – SystemExit was raised from the worker
(pid=3716485) Traceback (most recent call last):
(pid=3716485) File “python/ray/_raylet.pyx”, line 640, in ray._raylet.task_execution_handler
(pid=3716485) File “python/ray/_raylet.pyx”, line 488, in ray._raylet.execute_task
(pid=3716485) File “python/ray/_raylet.pyx”, line 525, in ray._raylet.execute_task
(pid=3716485) File “python/ray/_raylet.pyx”, line 532, in ray._raylet.execute_task
(pid=3716485) File “python/ray/_raylet.pyx”, line 536, in ray._raylet.execute_task
(pid=3716485) File “python/ray/_raylet.pyx”, line 486, in ray._raylet.execute_task.function_executor
(pid=3716485) File “/home/xin.chen/anaconda3/lib/python3.8/site-packages/ray/_private/function_manager.py”, line 563, in actor_method_executor
(pid=3716485) return method(__ray_actor, *args, **kwargs)
(pid=3716485) File “/home/xin.chen/anaconda3/lib/python3.8/site-packages/ray/actor.py”, line 1047, in ray_terminate
(pid=3716485) ray.actor.exit_actor()
(pid=3716485) File “/home/xin.chen/anaconda3/lib/python3.8/site-packages/ray/actor.py”, line 1109, in exit_actor
(pid=3716485) ray.worker.disconnect()
(pid=3716485) File “/home/xin.chen/anaconda3/lib/python3.8/site-packages/ray/worker.py”, line 1491, in disconnect
(pid=3716485) worker.import_thread.join_import_thread()
(pid=3716485) File “/home/xin.chen/anaconda3/lib/python3.8/site-packages/ray/_private/import_thread.py”, line 50, in join_import_thread
(pid=3716485) self.t.join()
(pid=3716485) File “/home/xin.chen/anaconda3/lib/python3.8/threading.py”, line 1011, in join
(pid=3716485) self._wait_for_tstate_lock()
(pid=3716485) File “/home/xin.chen/anaconda3/lib/python3.8/threading.py”, line 1027, in _wait_for_tstate_lock
(pid=3716485) elif lock.acquire(block, timeout):
(pid=3716485) File “/home/xin.chen/anaconda3/lib/python3.8/site-packages/ray/worker.py”, line 425, in sigterm_handler
(pid=3716485) sys.exit(1)
(pid=3716485) SystemExit: 1
(pid=3716488) 2021-10-02 00:18:44,257 ERROR worker.py:428 – SystemExit was raised from the worker
(pid=3716488) Traceback (most recent call last):
(pid=3716488) File “python/ray/_raylet.pyx”, line 532, in ray._raylet.execute_task
(pid=3716488) File “python/ray/_raylet.pyx”, line 536, in ray._raylet.execute_task
(pid=3716488) File “python/ray/_raylet.pyx”, line 486, in ray._raylet.execute_task.function_executor
(pid=3716488) File “/home/xin.chen/anaconda3/lib/python3.8/site-packages/ray/_private/function_manager.py”, line 563, in actor_method_executor
(pid=3716488) return method(__ray_actor, *args, **kwargs)
(pid=3716488) File “/home/xin.chen/anaconda3/lib/python3.8/site-packages/ray/actor.py”, line 1047, in ray_terminate
(pid=3716488) ray.actor.exit_actor()
(pid=3716488) File “/home/xin.chen/anaconda3/lib/python3.8/site-packages/ray/actor.py”, line 1123, in exit_actor
(pid=3716488) raise exit
(pid=3716488) SystemExit: 0
(pid=3716488)
(pid=3716488) During handling of the above exception, another exception occurred:
(pid=3716488)
(pid=3716488) Traceback (most recent call last):
(pid=3716488) File “python/ray/_raylet.pyx”, line 640, in ray._raylet.task_execution_handler
(pid=3716488) File “python/ray/_raylet.pyx”, line 488, in ray._raylet.execute_task
(pid=3716488) File “python/ray/_raylet.pyx”, line 525, in ray._raylet.execute_task
(pid=3716488) File “python/ray/includes/libcoreworker.pxi”, line 33, in ray._raylet.ProfileEvent.exit
(pid=3716488) File “/home/xin.chen/anaconda3/lib/python3.8/traceback.py”, line 167, in format_exc
(pid=3716488) return “”.join(format_exception(*sys.exc_info(), limit=limit, chain=chain))
(pid=3716488) File “/home/xin.chen/anaconda3/lib/python3.8/traceback.py”, line 120, in format_exception
(pid=3716488) return list(TracebackException(
(pid=3716488) File “/home/xin.chen/anaconda3/lib/python3.8/traceback.py”, line 508, in init
(pid=3716488) self.stack = StackSummary.extract(
(pid=3716488) File “/home/xin.chen/anaconda3/lib/python3.8/traceback.py”, line 366, in extract
(pid=3716488) f.line
(pid=3716488) File “/home/xin.chen/anaconda3/lib/python3.8/traceback.py”, line 288, in line
(pid=3716488) self._line = linecache.getline(self.filename, self.lineno).strip()
(pid=3716488) File “/home/xin.chen/anaconda3/lib/python3.8/linecache.py”, line 16, in getline
(pid=3716488) lines = getlines(filename, module_globals)
(pid=3716488) File “/home/xin.chen/anaconda3/lib/python3.8/linecache.py”, line 47, in getlines
(pid=3716488) return updatecache(filename, module_globals)
(pid=3716488) File “/home/xin.chen/anaconda3/lib/python3.8/linecache.py”, line 136, in updatecache
(pid=3716488) with tokenize.open(fullname) as fp:
(pid=3716488) File “/home/xin.chen/anaconda3/lib/python3.8/tokenize.py”, line 392, in open
(pid=3716488) buffer = _builtin_open(filename, ‘rb’)
(pid=3716488) File “/home/xin.chen/anaconda3/lib/python3.8/site-packages/ray/worker.py”, line 425, in sigterm_handler
(pid=3716488) sys.exit(1)
(pid=3716488) SystemExit: 1

Hey @Xin_Chen,

Thanks for reaching out! We’re actually in the process of revamping Ray SGD, and Ray SGD v2 will be in its Alpha as part of Ray 1.7.0, which should be released within the next week. If you’re interested, you can see the documentation here: RaySGD: Deep Learning on Ray — Ray v2.0.0.dev0. To try it out before Ray 1.7.0 is released, you can install a nightly wheel following these instructions!