Ray crashes on Slurm

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I am trying to get Ray to run on our Slurm cluster, but getting frequent crashes, most likely I think because I need to run multiple worker jobs (i.e. separate invocations of ray start) on the same physical machine sometimes. This is due to the way jobs are scheduled on the cluster, and I cannot change this. I have followed the discussion here: [core] [help] Running `ray start` on the same node in parallel would get port error · Issue #10154 · ray-project/ray · GitHub and I am setting unique node manager, object manager, and min and max worker ports for each ray start command on the workers. Nevertheless, I get crashes.

There seem to be two related issues.

  1. Two workers on the same machine get a SIGABRT due to a port already in use, and one of the workers will exit. Oddly there are no error messages in the worker job, but I do get an error on the head node, see below [1].

  2. Sometimes this will even lead to my tune.run() crashing, bringing the entire job down. See [2] below.

Anything I’m doing wrong here?

My head and worker start commands look like this:
ray start --head --node-ip-address=$1 --port=6379 --redis-password=$2 --num-cpus=20 --block -v
and
ray start --address $1--redis-password=$2 --num-cpus=5 --block --node-manager-port 16000 --object-manager-port 16001 --min-worker-port 16002 --max-worker-port 16099 -v

[1] Log from when worker crashes

(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,497 E 24640 24701] (raylet) logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): bind: Address already in use
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,515 E 24640 24640] (raylet) logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): bind: Address already in use
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,530 E 24640 24701] (raylet) logging.cc:104: Stack trace: 
(raylet, ip=10.31.133.83)  .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x47c3ea) [0x5636a491d3ea] ray::operator<<()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x47ebb8) [0x5636a491fbb8] ray::TerminateHandler()
(raylet, ip=10.31.133.83) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(+0xabf47) [0x2b0c8f3c9f47] __cxxabiv1::__terminate()
(raylet, ip=10.31.133.83) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(+0xabf7d) [0x2b0c8f3c9f7d] __cxxabiv1::__unexpected()
(raylet, ip=10.31.133.83) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(__cxa_rethrow+0) [0x2b0c8f3ca15a] __cxa_rethrow
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x1389c8) [0x5636a45d99c8] boost::throw_exception<>()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x98cfb9) [0x5636a4e2dfb9] boost::asio::detail::do_throw_error()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x1b974f) [0x5636a465a74f] _ZN5boost4asio21basic_socket_acceptorINS0_7generic15stream_protocolENS0_9execution12any_executorIJNS4_12context_as_tIRNS0_17execution_contextEEENS4_6detail8blocking7never_tILi0EEENS4_11prefer_onlyINSB_10possibly_tILi0EEEEENSE_INSA_16outstanding_work9tracked_tILi0EEEEENSE_INSI_11untracked_tILi0EEEEENSE_INSA_12relationship6fork_tILi0EEEEENSE_INSP_14continuation_tILi0EEEEEEEEEC1I23instrumented_io_contextEERT_RKNS2_14basic_endpointIS3_EEbPNSt9enable_ifIXsrSt14is_convertibleIS11_S8_E5valueEvE4typeE
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x2f2d2d) [0x5636a4793d2d] plasma::PlasmaStore::PlasmaStore()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x2ebe68) [0x5636a478ce68] plasma::PlasmaStoreRunner::Start()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x289445) [0x5636a472a445] std::thread::_State_impl<>::_M_run()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x9db6e0) [0x5636a4e7c6e0] execute_native_thread_routine
(raylet, ip=10.31.133.83) /lib64/libpthread.so.0(+0x7ea5) [0x2b0c8f508ea5] start_thread
(raylet, ip=10.31.133.83) /lib64/libc.so.6(clone+0x6d) [0x2b0c8fd259fd] clone
(raylet, ip=10.31.133.83) 
(raylet, ip=10.31.133.83) *** SIGABRT received at time=1666188338 on cpu 15 ***
(raylet, ip=10.31.133.83) PC: @     0x2b0c8fc5d387  (unknown)  raise
(raylet, ip=10.31.133.83)     @     0x2b0c8f510630       1872  (unknown)
(raylet, ip=10.31.133.83)     @     0x2b0c8f3c9f47  379532608  __cxxabiv1::__terminate()
(raylet, ip=10.31.133.83)     @     0x2b0c8f3ca095  (unknown)  __cxa_tm_cleanup
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361: *** SIGABRT received at time=1666188338 on cpu 15 ***
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361: PC: @     0x2b0c8fc5d387  (unknown)  raise
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361:     @     0x2b0c8f510630       1872  (unknown)
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361:     @     0x2b0c8f3c9f47  379532608  __cxxabiv1::__terminate()
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361:     @     0x2b0c8f3ca095  (unknown)  __cxa_tm_cleanup
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,497 E 24640 24701] (raylet) logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): bind: Address already in use
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,515 E 24640 24640] (raylet) logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): bind: Address already in use
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,530 E 24640 24701] (raylet) logging.cc:104: Stack trace: 
(raylet, ip=10.31.133.83)  .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x47c3ea) [0x5636a491d3ea] ray::operator<<()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x47ebb8) [0x5636a491fbb8] ray::TerminateHandler()
(raylet, ip=10.31.133.83) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(+0xabf47) [0x2b0c8f3c9f47] __cxxabiv1::__terminate()
(raylet, ip=10.31.133.83) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(+0xabf7d) [0x2b0c8f3c9f7d] __cxxabiv1::__unexpected()
(raylet, ip=10.31.133.83) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(__cxa_rethrow+0) [0x2b0c8f3ca15a] __cxa_rethrow
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x1389c8) [0x5636a45d99c8] boost::throw_exception<>()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x98cfb9) [0x5636a4e2dfb9] boost::asio::detail::do_throw_error()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x1b974f) [0x5636a465a74f] _ZN5boost4asio21basic_socket_acceptorINS0_7generic15stream_protocolENS0_9execution12any_executorIJNS4_12context_as_tIRNS0_17execution_contextEEENS4_6detail8blocking7never_tILi0EEENS4_11prefer_onlyINSB_10possibly_tILi0EEEEENSE_INSA_16outstanding_work9tracked_tILi0EEEEENSE_INSI_11untracked_tILi0EEEEENSE_INSA_12relationship6fork_tILi0EEEEENSE_INSP_14continuation_tILi0EEEEEEEEEC1I23instrumented_io_contextEERT_RKNS2_14basic_endpointIS3_EEbPNSt9enable_ifIXsrSt14is_convertibleIS11_S8_E5valueEvE4typeE
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x2f2d2d) [0x5636a4793d2d] plasma::PlasmaStore::PlasmaStore()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x2ebe68) [0x5636a478ce68] plasma::PlasmaStoreRunner::Start()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x289445) [0x5636a472a445] std::thread::_State_impl<>::_M_run()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x9db6e0) [0x5636a4e7c6e0] execute_native_thread_routine
(raylet, ip=10.31.133.83) /lib64/libpthread.so.0(+0x7ea5) [0x2b0c8f508ea5] start_thread
(raylet, ip=10.31.133.83) /lib64/libc.so.6(clone+0x6d) [0x2b0c8fd259fd] clone
(raylet, ip=10.31.133.83) 
(raylet, ip=10.31.133.83) *** SIGABRT received at time=1666188338 on cpu 15 ***
(raylet, ip=10.31.133.83) PC: @     0x2b0c8fc5d387  (unknown)  raise
(raylet, ip=10.31.133.83)     @     0x2b0c8f510630       1872  (unknown)
(raylet, ip=10.31.133.83)     @     0x2b0c8f3c9f47  379532608  __cxxabiv1::__terminate()
(raylet, ip=10.31.133.83)     @     0x2b0c8f3ca095  (unknown)  __cxa_tm_cleanup
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361: *** SIGABRT received at time=1666188338 on cpu 15 ***
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361: PC: @     0x2b0c8fc5d387  (unknown)  raise
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361:     @     0x2b0c8f510630       1872  (unknown)
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361:     @     0x2b0c8f3c9f47  379532608  __cxxabiv1::__terminate()
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361:     @     0x2b0c8f3ca095  (unknown)  __cxa_tm_cleanup
(raylet, ip=10.31.133.83) E1019 10:05:41.451662854   24724 server_chttp2.cc:48]        {"created":"@1666188341.451599595","description":"No address added out of total 1 resolved","file":"src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":872,"referenced_errors":[{"created":"@1666188341.451589337","description":"Failed to add any wildcard listeners","file":"src/core/lib/iomgr/tcp_server_posix.cc","file_line":348,"referenced_errors":[{"created":"@1666188341.451577393","description":"Unable to configure socket","fd":14,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1666188341.451573528","description":"Address already in use","errno":98,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1666188341.451588645","description":"Unable to configure socket","fd":14,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1666188341.451586177","description":"Address already in use","errno":98,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]}]}]}
(raylet, ip=10.31.133.83) E1019 10:05:41.451662854   24724 server_chttp2.cc:48]        {"created":"@1666188341.451599595","description":"No address added out of total 1 resolved","file":"src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":872,"referenced_errors":[{"created":"@1666188341.451589337","description":"Failed to add any wildcard listeners","file":"src/core/lib/iomgr/tcp_server_posix.cc","file_line":348,"referenced_errors":[{"created":"@1666188341.451577393","description":"Unable to configure socket","fd":14,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1666188341.451573528","description":"Address already in use","errno":98,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1666188341.451588645","description":"Unable to configure socket","fd":14,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1666188341.451586177","description":"Address already in use","errno":98,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]}]}]}
(my_trainable pid=94802) 2022-10-19 10:05:44,511     WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.algorithms.dqn.dqn.DEFAULT_CONFIG` has been deprecated. Use `ray.rllib.algorithms.dqn.dqn.DQNConfig(...)` instead. This will raise an error in the future!
(my_trainable pid=94802) 2022-10-19 10:05:44,542     WARNING deprecation.py:47 -- DeprecationWarning: `config['multiagent']['replay_mode']` has been deprecated. config['replay_buffer_config']['replay_mode'] This will raise an error in the future!
(my_trainable pid=94802) 2022-10-19 10:05:44,544     INFO simple_q.py:293 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting `simple_optimizer=True` if this doesn't work for you.
(my_trainable pid=94802) 2022-10-19 10:05:44,545     INFO algorithm.py:351 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(my_trainable pid=94802) 2022-10-19 10:05:44,968     WARNING deprecation.py:47 -- DeprecationWarning: `simple_optimizer` has been deprecated. This will raise an error in the future!
(my_trainable pid=94802) 2022-10-19 10:05:44,968     WARNING deprecation.py:47 -- DeprecationWarning: `config['multiagent']['replay_mode']` has been deprecated. config['replay_buffer_config']['replay_mode'] This will raise an error in the future!
(my_trainable pid=94802) 2022-10-19 10:05:45,182     WARNING util.py:65 -- Install gputil for GPU system monitoring.
(my_trainable pid=94802) 2022-10-19 10:05:45,358     WARNING multi_agent_prioritized_replay_buffer.py:220 -- Adding batches with column `weights` to this buffer while providing weights as a call argument to the add method results in the column being overwritten.
(my_trainable pid=94802) 2022-10-19 10:05:45,577     WARNING deprecation.py:47 -- DeprecationWarning: `concat_samples` has been deprecated. Use `concat_samples() from rllib.policy.sample_batch` instead. This will raise an error in the future!
(raylet, ip=10.31.133.85) [2022-10-19 10:05:46,972 E 6499 6581] (raylet) logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): bind: Address already in use
(raylet, ip=10.31.133.85) [2022-10-19 10:05:46,987 E 6499 6499] (raylet) logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): bind: Address already in use
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,004 E 6499 6581] (raylet) logging.cc:104: Stack trace: 
(raylet, ip=10.31.133.85)  .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x47c3ea) [0x555abae573ea] ray::operator<<()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x47ebb8) [0x555abae59bb8] ray::TerminateHandler()
(raylet, ip=10.31.133.85) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(+0xabf47) [0x2ada50389f47] __cxxabiv1::__terminate()
(raylet, ip=10.31.133.85) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(+0xabf7d) [0x2ada50389f7d] __cxxabiv1::__unexpected()
(raylet, ip=10.31.133.85) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(__cxa_rethrow+0) [0x2ada5038a15a] __cxa_rethrow
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x1389c8) [0x555abab139c8] boost::throw_exception<>()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x98cfb9) [0x555abb367fb9] boost::asio::detail::do_throw_error()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x1b974f) [0x555abab9474f] _ZN5boost4asio21basic_socket_acceptorINS0_7generic15stream_protocolENS0_9execution12any_executorIJNS4_12context_as_tIRNS0_17execution_contextEEENS4_6detail8blocking7never_tILi0EEENS4_11prefer_onlyINSB_10possibly_tILi0EEEEENSE_INSA_16outstanding_work9tracked_tILi0EEEEENSE_INSI_11untracked_tILi0EEEEENSE_INSA_12relationship6fork_tILi0EEEEENSE_INSP_14continuation_tILi0EEEEEEEEEC1I23instrumented_io_contextEERT_RKNS2_14basic_endpointIS3_EEbPNSt9enable_ifIXsrSt14is_convertibleIS11_S8_E5valueEvE4typeE
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x2f2d2d) [0x555abaccdd2d] plasma::PlasmaStore::PlasmaStore()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x2ebe68) [0x555abacc6e68] plasma::PlasmaStoreRunner::Start()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x289445) [0x555abac64445] std::thread::_State_impl<>::_M_run()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x9db6e0) [0x555abb3b66e0] execute_native_thread_routine
(raylet, ip=10.31.133.85) /lib64/libpthread.so.0(+0x7ea5) [0x2ada504c8ea5] start_thread
(raylet, ip=10.31.133.85) /lib64/libc.so.6(clone+0x6d) [0x2ada50ce59fd] clone
(raylet, ip=10.31.133.85) 
(raylet, ip=10.31.133.85) *** SIGABRT received at time=1666188347 on cpu 30 ***
(raylet, ip=10.31.133.85) PC: @     0x2ada50c1d387  (unknown)  raise
(raylet, ip=10.31.133.85)     @     0x2ada504d0630       1872  (unknown)
(raylet, ip=10.31.133.85)     @     0x2ada50389f47  362952000  __cxxabiv1::__terminate()
(raylet, ip=10.31.133.85)     @     0x2ada5038a095  (unknown)  __cxa_tm_cleanup
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,005 E 6499 6581] (raylet) logging.cc:361: *** SIGABRT received at time=1666188347 on cpu 30 ***
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,005 E 6499 6581] (raylet) logging.cc:361: PC: @     0x2ada50c1d387  (unknown)  raise
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,006 E 6499 6581] (raylet) logging.cc:361:     @     0x2ada504d0630       1872  (unknown)
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,006 E 6499 6581] (raylet) logging.cc:361:     @     0x2ada50389f47  362952000  __cxxabiv1::__terminate()
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,006 E 6499 6581] (raylet) logging.cc:361:     @     0x2ada5038a095  (unknown)  __cxa_tm_cleanup
(raylet, ip=10.31.133.85) [2022-10-19 10:05:46,972 E 6499 6581] (raylet) logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): bind: Address already in use
(raylet, ip=10.31.133.85) [2022-10-19 10:05:46,987 E 6499 6499] (raylet) logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): bind: Address already in use
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,004 E 6499 6581] (raylet) logging.cc:104: Stack trace: 
(raylet, ip=10.31.133.85)  .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x47c3ea) [0x555abae573ea] ray::operator<<()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x47ebb8) [0x555abae59bb8] ray::TerminateHandler()
(raylet, ip=10.31.133.85) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(+0xabf47) [0x2ada50389f47] __cxxabiv1::__terminate()
(raylet, ip=10.31.133.85) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(+0xabf7d) [0x2ada50389f7d] __cxxabiv1::__unexpected()
(raylet, ip=10.31.133.85) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(__cxa_rethrow+0) [0x2ada5038a15a] __cxa_rethrow
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x1389c8) [0x555abab139c8] boost::throw_exception<>()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x98cfb9) [0x555abb367fb9] boost::asio::detail::do_throw_error()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x1b974f) [0x555abab9474f] _ZN5boost4asio21basic_socket_acceptorINS0_7generic15stream_protocolENS0_9execution12any_executorIJNS4_12context_as_tIRNS0_17execution_contextEEENS4_6detail8blocking7never_tILi0EEENS4_11prefer_onlyINSB_10possibly_tILi0EEEEENSE_INSA_16outstanding_work9tracked_tILi0EEEEENSE_INSI_11untracked_tILi0EEEEENSE_INSA_12relationship6fork_tILi0EEEEENSE_INSP_14continuation_tILi0EEEEEEEEEC1I23instrumented_io_contextEERT_RKNS2_14basic_endpointIS3_EEbPNSt9enable_ifIXsrSt14is_convertibleIS11_S8_E5valueEvE4typeE
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x2f2d2d) [0x555abaccdd2d] plasma::PlasmaStore::PlasmaStore()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x2ebe68) [0x555abacc6e68] plasma::PlasmaStoreRunner::Start()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x289445) [0x555abac64445] std::thread::_State_impl<>::_M_run()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x9db6e0) [0x555abb3b66e0] execute_native_thread_routine
(raylet, ip=10.31.133.85) /lib64/libpthread.so.0(+0x7ea5) [0x2ada504c8ea5] start_thread
(raylet, ip=10.31.133.85) /lib64/libc.so.6(clone+0x6d) [0x2ada50ce59fd] clone
(raylet, ip=10.31.133.85) 
(raylet, ip=10.31.133.85) *** SIGABRT received at time=1666188347 on cpu 30 ***
(raylet, ip=10.31.133.85) PC: @     0x2ada50c1d387  (unknown)  raise
(raylet, ip=10.31.133.85)     @     0x2ada504d0630       1872  (unknown)
(raylet, ip=10.31.133.85)     @     0x2ada50389f47  362952000  __cxxabiv1::__terminate()
(raylet, ip=10.31.133.85)     @     0x2ada5038a095  (unknown)  __cxa_tm_cleanup
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,005 E 6499 6581] (raylet) logging.cc:361: *** SIGABRT received at time=1666188347 on cpu 30 ***
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,005 E 6499 6581] (raylet) logging.cc:361: PC: @     0x2ada50c1d387  (unknown)  raise
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,006 E 6499 6581] (raylet) logging.cc:361:     @     0x2ada504d0630       1872  (unknown)
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,006 E 6499 6581] (raylet) logging.cc:361:     @     0x2ada50389f47  362952000  __cxxabiv1::__terminate()
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,006 E 6499 6581] (raylet) logging.cc:361:     @     0x2ada5038a095  (unknown)  __cxa_tm_cleanup
(raylet, ip=10.31.133.85) E1019 10:05:49.881884938    6572 server_chttp2.cc:48]        {"created":"@1666188349.881824741","description":"No address added out of total 1 resolved","file":"src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":872,"referenced_errors":[{"created":"@1666188349.881814569","description":"Failed to add any wildcard listeners","file":"src/core/lib/iomgr/tcp_server_posix.cc","file_line":348,"referenced_errors":[{"created":"@1666188349.881801585","description":"Unable to configure socket","fd":14,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1666188349.881797781","description":"Address already in use","errno":98,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1666188349.881813655","description":"Unable to configure socket","fd":14,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1666188349.881810971","description":"Address already in use","errno":98,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]}]}]}
(raylet, ip=10.31.133.85) E1019 10:05:49.881884938    6572 server_chttp2.cc:48]        {"created":"@1666188349.881824741","description":"No address added out of total 1 resolved","file":"src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":872,"referenced_errors":[{"created":"@1666188349.881814569","description":"Failed to add any wildcard listeners","file":"src/core/lib/iomgr/tcp_server_posix.cc","file_line":348,"referenced_errors":[{"created":"@1666188349.881801585","description":"Unable to configure socket","fd":14,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1666188349.881797781","description":"Address already in use","errno":98,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1666188349.881813655","description":"Unable to configure socket","fd":14,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1666188349.881810971","description":"Address already in use","errno":98,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]}]}]}

[2] tune.run() crashing:

Traceback (most recent call last):
  File ".../main.py", line 499, in <module>
    main(args, args.num_cpus, group=args.experiment_group, name=args.experiment_name, ray_local_mode=args.ray_local_mode)
  File ".../main.py", line 475, in main
    tune.run(experiments, callbacks=callbacks, raise_on_failed_trial=False)
  File ".../lib/python3.9/site-packages/ray/tune/tune.py", line 427, in run
    return ray.get(remote_future)
  File "..../lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return getattr(ray, func.__name__)(*args, **kwargs)
  File ".../lib/python3.9/site-packages/ray/util/client/api.py", line 42, in get
    return self.worker.get(vals, timeout=timeout)
  File ".../lib/python3.9/site-packages/ray/util/client/worker.py", line 434, in get
    res = self._get(to_get, op_timeout)
  File ".../lib/python3.9/site-packages/ray/util/client/worker.py", line 462, in _get
    raise err
ray.exceptions.RayTaskError: ray::run() (pid=88223, ip=10.31.143.135)
  File ".../lib/python3.9/site-packages/ray/tune/tune.py", line 724, in run
    _report_progress(runner, progress_reporter)
  File ".../lib/python3.9/site-packages/ray/tune/tune.py", line 125, in _report_progress
    reporter.report(trials, done, sched_debug_str, executor_debug_str)
  File ".../lib/python3.9/site-packages/ray/tune/progress_reporter.py", line 641, in report
    print(self._progress_str(trials, done, *sys_info))
  File ".../lib/python3.9/site-packages/ray/tune/progress_reporter.py", line 347, in _progress_str
    user_metrics = self._infer_user_metrics(trials, self._infer_limit)
  File ".../lib/python3.9/site-packages/ray/tune/progress_reporter.py", line 396, in _infer_user_metrics
    if not t.last_result:
  File ".../lib/python3.9/site-packages/ray/tune/experiment/trial.py", line 445, in last_result
    self._get_default_result_or_future()
  File ".../lib/python3.9/site-packages/ray/tune/experiment/trial.py", line 420, in _get_default_result_or_future
    self._default_result_or_future = ray.get(self._default_result_or_future)
ray.exceptions.RuntimeEnvSetupError: Failed to setup runtime environment.
Could not create the actor because its associated runtime env failed to be created.
Failed to create runtime environment {"env_vars": {"TUNE_ORIG_WORKING_DIR": "..."}} because the Ray agent couldn't be started due to the port conflict. See `dashboard_agent.log` for more details. To solve the problem, start Ray with a hard-coded agent port. `ray start --dashboard-agent-grpc-port [port]` and make sure the port is not used by other processes.

Update: It seems this is resolved for me if I leave plenty of time between starting worker jobs - 60 seconds is enough for me. I suspect I need a larger interval in between because we have a fair bit of setup to do for each job, which can take a little longer or shorter each time, so we need some extra buffer to ensure no two jobs ever start at the same time.

This is a shot in the dark, but you could try hardcoding some other ports listed here: Ray Core API — Ray 2.0.1

Another guess would be to pass min-worker-port and max-worker-port to the ray start --head command and make sure that interval is disjoint from the one for the worker node command. (Right now it looks like you’re only passing it to the worker node)

Should all the ports listed there be distinct for each head/worker instance, or are there any that should be the same for workers and head? I wasn’t sure which of them referred to ports the client will open locally (i.e. should be unique), and which refer to ports on the server the client will try to connect to (i.e. should be the same everywhere).

Also I assume that bad things will happen if by chance I pick a port for any of these that’s already in use, correct?

Ah, I found a better docs page here: Configuring Ray — Ray 2.0.1 Does this clarify things?

I think so - just to be sure: I should set all the options listed under “All Nodes” on all instances (head and worker), and they should be unique for all of them. And I should set the options listed under “Head node” on all instances too, but they should be the identical everywhere because they all point to the head node (but obviously they shouldn’t overlap with any of the “all nodes” ports).

I assume if by chance a port I hardcode that way is already in use, then that instance will fail to start?

Yeah, I’m not too confident about it but that matches my understanding. It would be great to understand which port is actually causing the conflict, not sure the best way to debug it. But my guess is that because it only happens some time after you run tune.run(), it might has to do with workers starting during the run (and doesn’t have to do with long-running processes that start when ray start is called, like the dashboard agent and various managers). That’s where the worker-port guess came from.