How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I am trying to get Ray to run on our Slurm cluster, but getting frequent crashes, most likely I think because I need to run multiple worker jobs (i.e. separate invocations of ray start) on the same physical machine sometimes. This is due to the way jobs are scheduled on the cluster, and I cannot change this. I have followed the discussion here: [core] [help] Running `ray start` on the same node in parallel would get port error · Issue #10154 · ray-project/ray · GitHub and I am setting unique node manager, object manager, and min and max worker ports for each ray start command on the workers. Nevertheless, I get crashes.
There seem to be two related issues.
-
Two workers on the same machine get a SIGABRT due to a port already in use, and one of the workers will exit. Oddly there are no error messages in the worker job, but I do get an error on the head node, see below [1].
-
Sometimes this will even lead to my
tune.run()crashing, bringing the entire job down. See [2] below.
Anything I’m doing wrong here?
My head and worker start commands look like this:
ray start --head --node-ip-address=$1 --port=6379 --redis-password=$2 --num-cpus=20 --block -v
and
ray start --address $1--redis-password=$2 --num-cpus=5 --block --node-manager-port 16000 --object-manager-port 16001 --min-worker-port 16002 --max-worker-port 16099 -v
[1] Log from when worker crashes
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,497 E 24640 24701] (raylet) logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): bind: Address already in use
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,515 E 24640 24640] (raylet) logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): bind: Address already in use
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,530 E 24640 24701] (raylet) logging.cc:104: Stack trace:
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x47c3ea) [0x5636a491d3ea] ray::operator<<()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x47ebb8) [0x5636a491fbb8] ray::TerminateHandler()
(raylet, ip=10.31.133.83) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(+0xabf47) [0x2b0c8f3c9f47] __cxxabiv1::__terminate()
(raylet, ip=10.31.133.83) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(+0xabf7d) [0x2b0c8f3c9f7d] __cxxabiv1::__unexpected()
(raylet, ip=10.31.133.83) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(__cxa_rethrow+0) [0x2b0c8f3ca15a] __cxa_rethrow
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x1389c8) [0x5636a45d99c8] boost::throw_exception<>()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x98cfb9) [0x5636a4e2dfb9] boost::asio::detail::do_throw_error()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x1b974f) [0x5636a465a74f] _ZN5boost4asio21basic_socket_acceptorINS0_7generic15stream_protocolENS0_9execution12any_executorIJNS4_12context_as_tIRNS0_17execution_contextEEENS4_6detail8blocking7never_tILi0EEENS4_11prefer_onlyINSB_10possibly_tILi0EEEEENSE_INSA_16outstanding_work9tracked_tILi0EEEEENSE_INSI_11untracked_tILi0EEEEENSE_INSA_12relationship6fork_tILi0EEEEENSE_INSP_14continuation_tILi0EEEEEEEEEC1I23instrumented_io_contextEERT_RKNS2_14basic_endpointIS3_EEbPNSt9enable_ifIXsrSt14is_convertibleIS11_S8_E5valueEvE4typeE
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x2f2d2d) [0x5636a4793d2d] plasma::PlasmaStore::PlasmaStore()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x2ebe68) [0x5636a478ce68] plasma::PlasmaStoreRunner::Start()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x289445) [0x5636a472a445] std::thread::_State_impl<>::_M_run()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x9db6e0) [0x5636a4e7c6e0] execute_native_thread_routine
(raylet, ip=10.31.133.83) /lib64/libpthread.so.0(+0x7ea5) [0x2b0c8f508ea5] start_thread
(raylet, ip=10.31.133.83) /lib64/libc.so.6(clone+0x6d) [0x2b0c8fd259fd] clone
(raylet, ip=10.31.133.83)
(raylet, ip=10.31.133.83) *** SIGABRT received at time=1666188338 on cpu 15 ***
(raylet, ip=10.31.133.83) PC: @ 0x2b0c8fc5d387 (unknown) raise
(raylet, ip=10.31.133.83) @ 0x2b0c8f510630 1872 (unknown)
(raylet, ip=10.31.133.83) @ 0x2b0c8f3c9f47 379532608 __cxxabiv1::__terminate()
(raylet, ip=10.31.133.83) @ 0x2b0c8f3ca095 (unknown) __cxa_tm_cleanup
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361: *** SIGABRT received at time=1666188338 on cpu 15 ***
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361: PC: @ 0x2b0c8fc5d387 (unknown) raise
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361: @ 0x2b0c8f510630 1872 (unknown)
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361: @ 0x2b0c8f3c9f47 379532608 __cxxabiv1::__terminate()
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361: @ 0x2b0c8f3ca095 (unknown) __cxa_tm_cleanup
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,497 E 24640 24701] (raylet) logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): bind: Address already in use
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,515 E 24640 24640] (raylet) logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): bind: Address already in use
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,530 E 24640 24701] (raylet) logging.cc:104: Stack trace:
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x47c3ea) [0x5636a491d3ea] ray::operator<<()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x47ebb8) [0x5636a491fbb8] ray::TerminateHandler()
(raylet, ip=10.31.133.83) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(+0xabf47) [0x2b0c8f3c9f47] __cxxabiv1::__terminate()
(raylet, ip=10.31.133.83) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(+0xabf7d) [0x2b0c8f3c9f7d] __cxxabiv1::__unexpected()
(raylet, ip=10.31.133.83) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(__cxa_rethrow+0) [0x2b0c8f3ca15a] __cxa_rethrow
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x1389c8) [0x5636a45d99c8] boost::throw_exception<>()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x98cfb9) [0x5636a4e2dfb9] boost::asio::detail::do_throw_error()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x1b974f) [0x5636a465a74f] _ZN5boost4asio21basic_socket_acceptorINS0_7generic15stream_protocolENS0_9execution12any_executorIJNS4_12context_as_tIRNS0_17execution_contextEEENS4_6detail8blocking7never_tILi0EEENS4_11prefer_onlyINSB_10possibly_tILi0EEEEENSE_INSA_16outstanding_work9tracked_tILi0EEEEENSE_INSI_11untracked_tILi0EEEEENSE_INSA_12relationship6fork_tILi0EEEEENSE_INSP_14continuation_tILi0EEEEEEEEEC1I23instrumented_io_contextEERT_RKNS2_14basic_endpointIS3_EEbPNSt9enable_ifIXsrSt14is_convertibleIS11_S8_E5valueEvE4typeE
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x2f2d2d) [0x5636a4793d2d] plasma::PlasmaStore::PlasmaStore()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x2ebe68) [0x5636a478ce68] plasma::PlasmaStoreRunner::Start()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x289445) [0x5636a472a445] std::thread::_State_impl<>::_M_run()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x9db6e0) [0x5636a4e7c6e0] execute_native_thread_routine
(raylet, ip=10.31.133.83) /lib64/libpthread.so.0(+0x7ea5) [0x2b0c8f508ea5] start_thread
(raylet, ip=10.31.133.83) /lib64/libc.so.6(clone+0x6d) [0x2b0c8fd259fd] clone
(raylet, ip=10.31.133.83)
(raylet, ip=10.31.133.83) *** SIGABRT received at time=1666188338 on cpu 15 ***
(raylet, ip=10.31.133.83) PC: @ 0x2b0c8fc5d387 (unknown) raise
(raylet, ip=10.31.133.83) @ 0x2b0c8f510630 1872 (unknown)
(raylet, ip=10.31.133.83) @ 0x2b0c8f3c9f47 379532608 __cxxabiv1::__terminate()
(raylet, ip=10.31.133.83) @ 0x2b0c8f3ca095 (unknown) __cxa_tm_cleanup
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361: *** SIGABRT received at time=1666188338 on cpu 15 ***
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361: PC: @ 0x2b0c8fc5d387 (unknown) raise
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361: @ 0x2b0c8f510630 1872 (unknown)
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361: @ 0x2b0c8f3c9f47 379532608 __cxxabiv1::__terminate()
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361: @ 0x2b0c8f3ca095 (unknown) __cxa_tm_cleanup
(raylet, ip=10.31.133.83) E1019 10:05:41.451662854 24724 server_chttp2.cc:48] {"created":"@1666188341.451599595","description":"No address added out of total 1 resolved","file":"src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":872,"referenced_errors":[{"created":"@1666188341.451589337","description":"Failed to add any wildcard listeners","file":"src/core/lib/iomgr/tcp_server_posix.cc","file_line":348,"referenced_errors":[{"created":"@1666188341.451577393","description":"Unable to configure socket","fd":14,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1666188341.451573528","description":"Address already in use","errno":98,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1666188341.451588645","description":"Unable to configure socket","fd":14,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1666188341.451586177","description":"Address already in use","errno":98,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]}]}]}
(raylet, ip=10.31.133.83) E1019 10:05:41.451662854 24724 server_chttp2.cc:48] {"created":"@1666188341.451599595","description":"No address added out of total 1 resolved","file":"src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":872,"referenced_errors":[{"created":"@1666188341.451589337","description":"Failed to add any wildcard listeners","file":"src/core/lib/iomgr/tcp_server_posix.cc","file_line":348,"referenced_errors":[{"created":"@1666188341.451577393","description":"Unable to configure socket","fd":14,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1666188341.451573528","description":"Address already in use","errno":98,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1666188341.451588645","description":"Unable to configure socket","fd":14,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1666188341.451586177","description":"Address already in use","errno":98,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]}]}]}
(my_trainable pid=94802) 2022-10-19 10:05:44,511 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.algorithms.dqn.dqn.DEFAULT_CONFIG` has been deprecated. Use `ray.rllib.algorithms.dqn.dqn.DQNConfig(...)` instead. This will raise an error in the future!
(my_trainable pid=94802) 2022-10-19 10:05:44,542 WARNING deprecation.py:47 -- DeprecationWarning: `config['multiagent']['replay_mode']` has been deprecated. config['replay_buffer_config']['replay_mode'] This will raise an error in the future!
(my_trainable pid=94802) 2022-10-19 10:05:44,544 INFO simple_q.py:293 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting `simple_optimizer=True` if this doesn't work for you.
(my_trainable pid=94802) 2022-10-19 10:05:44,545 INFO algorithm.py:351 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(my_trainable pid=94802) 2022-10-19 10:05:44,968 WARNING deprecation.py:47 -- DeprecationWarning: `simple_optimizer` has been deprecated. This will raise an error in the future!
(my_trainable pid=94802) 2022-10-19 10:05:44,968 WARNING deprecation.py:47 -- DeprecationWarning: `config['multiagent']['replay_mode']` has been deprecated. config['replay_buffer_config']['replay_mode'] This will raise an error in the future!
(my_trainable pid=94802) 2022-10-19 10:05:45,182 WARNING util.py:65 -- Install gputil for GPU system monitoring.
(my_trainable pid=94802) 2022-10-19 10:05:45,358 WARNING multi_agent_prioritized_replay_buffer.py:220 -- Adding batches with column `weights` to this buffer while providing weights as a call argument to the add method results in the column being overwritten.
(my_trainable pid=94802) 2022-10-19 10:05:45,577 WARNING deprecation.py:47 -- DeprecationWarning: `concat_samples` has been deprecated. Use `concat_samples() from rllib.policy.sample_batch` instead. This will raise an error in the future!
(raylet, ip=10.31.133.85) [2022-10-19 10:05:46,972 E 6499 6581] (raylet) logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): bind: Address already in use
(raylet, ip=10.31.133.85) [2022-10-19 10:05:46,987 E 6499 6499] (raylet) logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): bind: Address already in use
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,004 E 6499 6581] (raylet) logging.cc:104: Stack trace:
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x47c3ea) [0x555abae573ea] ray::operator<<()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x47ebb8) [0x555abae59bb8] ray::TerminateHandler()
(raylet, ip=10.31.133.85) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(+0xabf47) [0x2ada50389f47] __cxxabiv1::__terminate()
(raylet, ip=10.31.133.85) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(+0xabf7d) [0x2ada50389f7d] __cxxabiv1::__unexpected()
(raylet, ip=10.31.133.85) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(__cxa_rethrow+0) [0x2ada5038a15a] __cxa_rethrow
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x1389c8) [0x555abab139c8] boost::throw_exception<>()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x98cfb9) [0x555abb367fb9] boost::asio::detail::do_throw_error()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x1b974f) [0x555abab9474f] _ZN5boost4asio21basic_socket_acceptorINS0_7generic15stream_protocolENS0_9execution12any_executorIJNS4_12context_as_tIRNS0_17execution_contextEEENS4_6detail8blocking7never_tILi0EEENS4_11prefer_onlyINSB_10possibly_tILi0EEEEENSE_INSA_16outstanding_work9tracked_tILi0EEEEENSE_INSI_11untracked_tILi0EEEEENSE_INSA_12relationship6fork_tILi0EEEEENSE_INSP_14continuation_tILi0EEEEEEEEEC1I23instrumented_io_contextEERT_RKNS2_14basic_endpointIS3_EEbPNSt9enable_ifIXsrSt14is_convertibleIS11_S8_E5valueEvE4typeE
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x2f2d2d) [0x555abaccdd2d] plasma::PlasmaStore::PlasmaStore()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x2ebe68) [0x555abacc6e68] plasma::PlasmaStoreRunner::Start()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x289445) [0x555abac64445] std::thread::_State_impl<>::_M_run()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x9db6e0) [0x555abb3b66e0] execute_native_thread_routine
(raylet, ip=10.31.133.85) /lib64/libpthread.so.0(+0x7ea5) [0x2ada504c8ea5] start_thread
(raylet, ip=10.31.133.85) /lib64/libc.so.6(clone+0x6d) [0x2ada50ce59fd] clone
(raylet, ip=10.31.133.85)
(raylet, ip=10.31.133.85) *** SIGABRT received at time=1666188347 on cpu 30 ***
(raylet, ip=10.31.133.85) PC: @ 0x2ada50c1d387 (unknown) raise
(raylet, ip=10.31.133.85) @ 0x2ada504d0630 1872 (unknown)
(raylet, ip=10.31.133.85) @ 0x2ada50389f47 362952000 __cxxabiv1::__terminate()
(raylet, ip=10.31.133.85) @ 0x2ada5038a095 (unknown) __cxa_tm_cleanup
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,005 E 6499 6581] (raylet) logging.cc:361: *** SIGABRT received at time=1666188347 on cpu 30 ***
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,005 E 6499 6581] (raylet) logging.cc:361: PC: @ 0x2ada50c1d387 (unknown) raise
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,006 E 6499 6581] (raylet) logging.cc:361: @ 0x2ada504d0630 1872 (unknown)
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,006 E 6499 6581] (raylet) logging.cc:361: @ 0x2ada50389f47 362952000 __cxxabiv1::__terminate()
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,006 E 6499 6581] (raylet) logging.cc:361: @ 0x2ada5038a095 (unknown) __cxa_tm_cleanup
(raylet, ip=10.31.133.85) [2022-10-19 10:05:46,972 E 6499 6581] (raylet) logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): bind: Address already in use
(raylet, ip=10.31.133.85) [2022-10-19 10:05:46,987 E 6499 6499] (raylet) logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): bind: Address already in use
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,004 E 6499 6581] (raylet) logging.cc:104: Stack trace:
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x47c3ea) [0x555abae573ea] ray::operator<<()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x47ebb8) [0x555abae59bb8] ray::TerminateHandler()
(raylet, ip=10.31.133.85) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(+0xabf47) [0x2ada50389f47] __cxxabiv1::__terminate()
(raylet, ip=10.31.133.85) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(+0xabf7d) [0x2ada50389f7d] __cxxabiv1::__unexpected()
(raylet, ip=10.31.133.85) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(__cxa_rethrow+0) [0x2ada5038a15a] __cxa_rethrow
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x1389c8) [0x555abab139c8] boost::throw_exception<>()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x98cfb9) [0x555abb367fb9] boost::asio::detail::do_throw_error()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x1b974f) [0x555abab9474f] _ZN5boost4asio21basic_socket_acceptorINS0_7generic15stream_protocolENS0_9execution12any_executorIJNS4_12context_as_tIRNS0_17execution_contextEEENS4_6detail8blocking7never_tILi0EEENS4_11prefer_onlyINSB_10possibly_tILi0EEEEENSE_INSA_16outstanding_work9tracked_tILi0EEEEENSE_INSI_11untracked_tILi0EEEEENSE_INSA_12relationship6fork_tILi0EEEEENSE_INSP_14continuation_tILi0EEEEEEEEEC1I23instrumented_io_contextEERT_RKNS2_14basic_endpointIS3_EEbPNSt9enable_ifIXsrSt14is_convertibleIS11_S8_E5valueEvE4typeE
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x2f2d2d) [0x555abaccdd2d] plasma::PlasmaStore::PlasmaStore()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x2ebe68) [0x555abacc6e68] plasma::PlasmaStoreRunner::Start()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x289445) [0x555abac64445] std::thread::_State_impl<>::_M_run()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x9db6e0) [0x555abb3b66e0] execute_native_thread_routine
(raylet, ip=10.31.133.85) /lib64/libpthread.so.0(+0x7ea5) [0x2ada504c8ea5] start_thread
(raylet, ip=10.31.133.85) /lib64/libc.so.6(clone+0x6d) [0x2ada50ce59fd] clone
(raylet, ip=10.31.133.85)
(raylet, ip=10.31.133.85) *** SIGABRT received at time=1666188347 on cpu 30 ***
(raylet, ip=10.31.133.85) PC: @ 0x2ada50c1d387 (unknown) raise
(raylet, ip=10.31.133.85) @ 0x2ada504d0630 1872 (unknown)
(raylet, ip=10.31.133.85) @ 0x2ada50389f47 362952000 __cxxabiv1::__terminate()
(raylet, ip=10.31.133.85) @ 0x2ada5038a095 (unknown) __cxa_tm_cleanup
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,005 E 6499 6581] (raylet) logging.cc:361: *** SIGABRT received at time=1666188347 on cpu 30 ***
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,005 E 6499 6581] (raylet) logging.cc:361: PC: @ 0x2ada50c1d387 (unknown) raise
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,006 E 6499 6581] (raylet) logging.cc:361: @ 0x2ada504d0630 1872 (unknown)
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,006 E 6499 6581] (raylet) logging.cc:361: @ 0x2ada50389f47 362952000 __cxxabiv1::__terminate()
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,006 E 6499 6581] (raylet) logging.cc:361: @ 0x2ada5038a095 (unknown) __cxa_tm_cleanup
(raylet, ip=10.31.133.85) E1019 10:05:49.881884938 6572 server_chttp2.cc:48] {"created":"@1666188349.881824741","description":"No address added out of total 1 resolved","file":"src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":872,"referenced_errors":[{"created":"@1666188349.881814569","description":"Failed to add any wildcard listeners","file":"src/core/lib/iomgr/tcp_server_posix.cc","file_line":348,"referenced_errors":[{"created":"@1666188349.881801585","description":"Unable to configure socket","fd":14,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1666188349.881797781","description":"Address already in use","errno":98,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1666188349.881813655","description":"Unable to configure socket","fd":14,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1666188349.881810971","description":"Address already in use","errno":98,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]}]}]}
(raylet, ip=10.31.133.85) E1019 10:05:49.881884938 6572 server_chttp2.cc:48] {"created":"@1666188349.881824741","description":"No address added out of total 1 resolved","file":"src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":872,"referenced_errors":[{"created":"@1666188349.881814569","description":"Failed to add any wildcard listeners","file":"src/core/lib/iomgr/tcp_server_posix.cc","file_line":348,"referenced_errors":[{"created":"@1666188349.881801585","description":"Unable to configure socket","fd":14,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1666188349.881797781","description":"Address already in use","errno":98,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1666188349.881813655","description":"Unable to configure socket","fd":14,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1666188349.881810971","description":"Address already in use","errno":98,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]}]}]}
[2] tune.run() crashing:
Traceback (most recent call last):
File ".../main.py", line 499, in <module>
main(args, args.num_cpus, group=args.experiment_group, name=args.experiment_name, ray_local_mode=args.ray_local_mode)
File ".../main.py", line 475, in main
tune.run(experiments, callbacks=callbacks, raise_on_failed_trial=False)
File ".../lib/python3.9/site-packages/ray/tune/tune.py", line 427, in run
return ray.get(remote_future)
File "..../lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return getattr(ray, func.__name__)(*args, **kwargs)
File ".../lib/python3.9/site-packages/ray/util/client/api.py", line 42, in get
return self.worker.get(vals, timeout=timeout)
File ".../lib/python3.9/site-packages/ray/util/client/worker.py", line 434, in get
res = self._get(to_get, op_timeout)
File ".../lib/python3.9/site-packages/ray/util/client/worker.py", line 462, in _get
raise err
ray.exceptions.RayTaskError: ray::run() (pid=88223, ip=10.31.143.135)
File ".../lib/python3.9/site-packages/ray/tune/tune.py", line 724, in run
_report_progress(runner, progress_reporter)
File ".../lib/python3.9/site-packages/ray/tune/tune.py", line 125, in _report_progress
reporter.report(trials, done, sched_debug_str, executor_debug_str)
File ".../lib/python3.9/site-packages/ray/tune/progress_reporter.py", line 641, in report
print(self._progress_str(trials, done, *sys_info))
File ".../lib/python3.9/site-packages/ray/tune/progress_reporter.py", line 347, in _progress_str
user_metrics = self._infer_user_metrics(trials, self._infer_limit)
File ".../lib/python3.9/site-packages/ray/tune/progress_reporter.py", line 396, in _infer_user_metrics
if not t.last_result:
File ".../lib/python3.9/site-packages/ray/tune/experiment/trial.py", line 445, in last_result
self._get_default_result_or_future()
File ".../lib/python3.9/site-packages/ray/tune/experiment/trial.py", line 420, in _get_default_result_or_future
self._default_result_or_future = ray.get(self._default_result_or_future)
ray.exceptions.RuntimeEnvSetupError: Failed to setup runtime environment.
Could not create the actor because its associated runtime env failed to be created.
Failed to create runtime environment {"env_vars": {"TUNE_ORIG_WORKING_DIR": "..."}} because the Ray agent couldn't be started due to the port conflict. See `dashboard_agent.log` for more details. To solve the problem, start Ray with a hard-coded agent port. `ray start --dashboard-agent-grpc-port [port]` and make sure the port is not used by other processes.