How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I am trying to get Ray to run on our Slurm cluster, but getting frequent crashes, most likely I think because I need to run multiple worker jobs (i.e. separate invocations of ray start
) on the same physical machine sometimes. This is due to the way jobs are scheduled on the cluster, and I cannot change this. I have followed the discussion here: [core] [help] Running `ray start` on the same node in parallel would get port error · Issue #10154 · ray-project/ray · GitHub and I am setting unique node manager, object manager, and min and max worker ports for each ray start
command on the workers. Nevertheless, I get crashes.
There seem to be two related issues.
-
Two workers on the same machine get a SIGABRT due to a port already in use, and one of the workers will exit. Oddly there are no error messages in the worker job, but I do get an error on the head node, see below [1].
-
Sometimes this will even lead to my
tune.run()
crashing, bringing the entire job down. See [2] below.
Anything I’m doing wrong here?
My head and worker start commands look like this:
ray start --head --node-ip-address=$1 --port=6379 --redis-password=$2 --num-cpus=20 --block -v
and
ray start --address $1--redis-password=$2 --num-cpus=5 --block --node-manager-port 16000 --object-manager-port 16001 --min-worker-port 16002 --max-worker-port 16099 -v
[1] Log from when worker crashes
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,497 E 24640 24701] (raylet) logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): bind: Address already in use
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,515 E 24640 24640] (raylet) logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): bind: Address already in use
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,530 E 24640 24701] (raylet) logging.cc:104: Stack trace:
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x47c3ea) [0x5636a491d3ea] ray::operator<<()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x47ebb8) [0x5636a491fbb8] ray::TerminateHandler()
(raylet, ip=10.31.133.83) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(+0xabf47) [0x2b0c8f3c9f47] __cxxabiv1::__terminate()
(raylet, ip=10.31.133.83) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(+0xabf7d) [0x2b0c8f3c9f7d] __cxxabiv1::__unexpected()
(raylet, ip=10.31.133.83) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(__cxa_rethrow+0) [0x2b0c8f3ca15a] __cxa_rethrow
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x1389c8) [0x5636a45d99c8] boost::throw_exception<>()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x98cfb9) [0x5636a4e2dfb9] boost::asio::detail::do_throw_error()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x1b974f) [0x5636a465a74f] _ZN5boost4asio21basic_socket_acceptorINS0_7generic15stream_protocolENS0_9execution12any_executorIJNS4_12context_as_tIRNS0_17execution_contextEEENS4_6detail8blocking7never_tILi0EEENS4_11prefer_onlyINSB_10possibly_tILi0EEEEENSE_INSA_16outstanding_work9tracked_tILi0EEEEENSE_INSI_11untracked_tILi0EEEEENSE_INSA_12relationship6fork_tILi0EEEEENSE_INSP_14continuation_tILi0EEEEEEEEEC1I23instrumented_io_contextEERT_RKNS2_14basic_endpointIS3_EEbPNSt9enable_ifIXsrSt14is_convertibleIS11_S8_E5valueEvE4typeE
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x2f2d2d) [0x5636a4793d2d] plasma::PlasmaStore::PlasmaStore()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x2ebe68) [0x5636a478ce68] plasma::PlasmaStoreRunner::Start()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x289445) [0x5636a472a445] std::thread::_State_impl<>::_M_run()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x9db6e0) [0x5636a4e7c6e0] execute_native_thread_routine
(raylet, ip=10.31.133.83) /lib64/libpthread.so.0(+0x7ea5) [0x2b0c8f508ea5] start_thread
(raylet, ip=10.31.133.83) /lib64/libc.so.6(clone+0x6d) [0x2b0c8fd259fd] clone
(raylet, ip=10.31.133.83)
(raylet, ip=10.31.133.83) *** SIGABRT received at time=1666188338 on cpu 15 ***
(raylet, ip=10.31.133.83) PC: @ 0x2b0c8fc5d387 (unknown) raise
(raylet, ip=10.31.133.83) @ 0x2b0c8f510630 1872 (unknown)
(raylet, ip=10.31.133.83) @ 0x2b0c8f3c9f47 379532608 __cxxabiv1::__terminate()
(raylet, ip=10.31.133.83) @ 0x2b0c8f3ca095 (unknown) __cxa_tm_cleanup
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361: *** SIGABRT received at time=1666188338 on cpu 15 ***
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361: PC: @ 0x2b0c8fc5d387 (unknown) raise
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361: @ 0x2b0c8f510630 1872 (unknown)
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361: @ 0x2b0c8f3c9f47 379532608 __cxxabiv1::__terminate()
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361: @ 0x2b0c8f3ca095 (unknown) __cxa_tm_cleanup
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,497 E 24640 24701] (raylet) logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): bind: Address already in use
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,515 E 24640 24640] (raylet) logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): bind: Address already in use
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,530 E 24640 24701] (raylet) logging.cc:104: Stack trace:
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x47c3ea) [0x5636a491d3ea] ray::operator<<()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x47ebb8) [0x5636a491fbb8] ray::TerminateHandler()
(raylet, ip=10.31.133.83) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(+0xabf47) [0x2b0c8f3c9f47] __cxxabiv1::__terminate()
(raylet, ip=10.31.133.83) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(+0xabf7d) [0x2b0c8f3c9f7d] __cxxabiv1::__unexpected()
(raylet, ip=10.31.133.83) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(__cxa_rethrow+0) [0x2b0c8f3ca15a] __cxa_rethrow
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x1389c8) [0x5636a45d99c8] boost::throw_exception<>()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x98cfb9) [0x5636a4e2dfb9] boost::asio::detail::do_throw_error()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x1b974f) [0x5636a465a74f] _ZN5boost4asio21basic_socket_acceptorINS0_7generic15stream_protocolENS0_9execution12any_executorIJNS4_12context_as_tIRNS0_17execution_contextEEENS4_6detail8blocking7never_tILi0EEENS4_11prefer_onlyINSB_10possibly_tILi0EEEEENSE_INSA_16outstanding_work9tracked_tILi0EEEEENSE_INSI_11untracked_tILi0EEEEENSE_INSA_12relationship6fork_tILi0EEEEENSE_INSP_14continuation_tILi0EEEEEEEEEC1I23instrumented_io_contextEERT_RKNS2_14basic_endpointIS3_EEbPNSt9enable_ifIXsrSt14is_convertibleIS11_S8_E5valueEvE4typeE
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x2f2d2d) [0x5636a4793d2d] plasma::PlasmaStore::PlasmaStore()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x2ebe68) [0x5636a478ce68] plasma::PlasmaStoreRunner::Start()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x289445) [0x5636a472a445] std::thread::_State_impl<>::_M_run()
(raylet, ip=10.31.133.83) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x9db6e0) [0x5636a4e7c6e0] execute_native_thread_routine
(raylet, ip=10.31.133.83) /lib64/libpthread.so.0(+0x7ea5) [0x2b0c8f508ea5] start_thread
(raylet, ip=10.31.133.83) /lib64/libc.so.6(clone+0x6d) [0x2b0c8fd259fd] clone
(raylet, ip=10.31.133.83)
(raylet, ip=10.31.133.83) *** SIGABRT received at time=1666188338 on cpu 15 ***
(raylet, ip=10.31.133.83) PC: @ 0x2b0c8fc5d387 (unknown) raise
(raylet, ip=10.31.133.83) @ 0x2b0c8f510630 1872 (unknown)
(raylet, ip=10.31.133.83) @ 0x2b0c8f3c9f47 379532608 __cxxabiv1::__terminate()
(raylet, ip=10.31.133.83) @ 0x2b0c8f3ca095 (unknown) __cxa_tm_cleanup
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361: *** SIGABRT received at time=1666188338 on cpu 15 ***
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361: PC: @ 0x2b0c8fc5d387 (unknown) raise
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361: @ 0x2b0c8f510630 1872 (unknown)
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361: @ 0x2b0c8f3c9f47 379532608 __cxxabiv1::__terminate()
(raylet, ip=10.31.133.83) [2022-10-19 10:05:38,531 E 24640 24701] (raylet) logging.cc:361: @ 0x2b0c8f3ca095 (unknown) __cxa_tm_cleanup
(raylet, ip=10.31.133.83) E1019 10:05:41.451662854 24724 server_chttp2.cc:48] {"created":"@1666188341.451599595","description":"No address added out of total 1 resolved","file":"src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":872,"referenced_errors":[{"created":"@1666188341.451589337","description":"Failed to add any wildcard listeners","file":"src/core/lib/iomgr/tcp_server_posix.cc","file_line":348,"referenced_errors":[{"created":"@1666188341.451577393","description":"Unable to configure socket","fd":14,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1666188341.451573528","description":"Address already in use","errno":98,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1666188341.451588645","description":"Unable to configure socket","fd":14,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1666188341.451586177","description":"Address already in use","errno":98,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]}]}]}
(raylet, ip=10.31.133.83) E1019 10:05:41.451662854 24724 server_chttp2.cc:48] {"created":"@1666188341.451599595","description":"No address added out of total 1 resolved","file":"src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":872,"referenced_errors":[{"created":"@1666188341.451589337","description":"Failed to add any wildcard listeners","file":"src/core/lib/iomgr/tcp_server_posix.cc","file_line":348,"referenced_errors":[{"created":"@1666188341.451577393","description":"Unable to configure socket","fd":14,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1666188341.451573528","description":"Address already in use","errno":98,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1666188341.451588645","description":"Unable to configure socket","fd":14,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1666188341.451586177","description":"Address already in use","errno":98,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]}]}]}
(my_trainable pid=94802) 2022-10-19 10:05:44,511 WARNING deprecation.py:47 -- DeprecationWarning: `ray.rllib.algorithms.dqn.dqn.DEFAULT_CONFIG` has been deprecated. Use `ray.rllib.algorithms.dqn.dqn.DQNConfig(...)` instead. This will raise an error in the future!
(my_trainable pid=94802) 2022-10-19 10:05:44,542 WARNING deprecation.py:47 -- DeprecationWarning: `config['multiagent']['replay_mode']` has been deprecated. config['replay_buffer_config']['replay_mode'] This will raise an error in the future!
(my_trainable pid=94802) 2022-10-19 10:05:44,544 INFO simple_q.py:293 -- In multi-agent mode, policies will be optimized sequentially by the multi-GPU optimizer. Consider setting `simple_optimizer=True` if this doesn't work for you.
(my_trainable pid=94802) 2022-10-19 10:05:44,545 INFO algorithm.py:351 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
(my_trainable pid=94802) 2022-10-19 10:05:44,968 WARNING deprecation.py:47 -- DeprecationWarning: `simple_optimizer` has been deprecated. This will raise an error in the future!
(my_trainable pid=94802) 2022-10-19 10:05:44,968 WARNING deprecation.py:47 -- DeprecationWarning: `config['multiagent']['replay_mode']` has been deprecated. config['replay_buffer_config']['replay_mode'] This will raise an error in the future!
(my_trainable pid=94802) 2022-10-19 10:05:45,182 WARNING util.py:65 -- Install gputil for GPU system monitoring.
(my_trainable pid=94802) 2022-10-19 10:05:45,358 WARNING multi_agent_prioritized_replay_buffer.py:220 -- Adding batches with column `weights` to this buffer while providing weights as a call argument to the add method results in the column being overwritten.
(my_trainable pid=94802) 2022-10-19 10:05:45,577 WARNING deprecation.py:47 -- DeprecationWarning: `concat_samples` has been deprecated. Use `concat_samples() from rllib.policy.sample_batch` instead. This will raise an error in the future!
(raylet, ip=10.31.133.85) [2022-10-19 10:05:46,972 E 6499 6581] (raylet) logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): bind: Address already in use
(raylet, ip=10.31.133.85) [2022-10-19 10:05:46,987 E 6499 6499] (raylet) logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): bind: Address already in use
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,004 E 6499 6581] (raylet) logging.cc:104: Stack trace:
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x47c3ea) [0x555abae573ea] ray::operator<<()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x47ebb8) [0x555abae59bb8] ray::TerminateHandler()
(raylet, ip=10.31.133.85) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(+0xabf47) [0x2ada50389f47] __cxxabiv1::__terminate()
(raylet, ip=10.31.133.85) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(+0xabf7d) [0x2ada50389f7d] __cxxabiv1::__unexpected()
(raylet, ip=10.31.133.85) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(__cxa_rethrow+0) [0x2ada5038a15a] __cxa_rethrow
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x1389c8) [0x555abab139c8] boost::throw_exception<>()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x98cfb9) [0x555abb367fb9] boost::asio::detail::do_throw_error()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x1b974f) [0x555abab9474f] _ZN5boost4asio21basic_socket_acceptorINS0_7generic15stream_protocolENS0_9execution12any_executorIJNS4_12context_as_tIRNS0_17execution_contextEEENS4_6detail8blocking7never_tILi0EEENS4_11prefer_onlyINSB_10possibly_tILi0EEEEENSE_INSA_16outstanding_work9tracked_tILi0EEEEENSE_INSI_11untracked_tILi0EEEEENSE_INSA_12relationship6fork_tILi0EEEEENSE_INSP_14continuation_tILi0EEEEEEEEEC1I23instrumented_io_contextEERT_RKNS2_14basic_endpointIS3_EEbPNSt9enable_ifIXsrSt14is_convertibleIS11_S8_E5valueEvE4typeE
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x2f2d2d) [0x555abaccdd2d] plasma::PlasmaStore::PlasmaStore()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x2ebe68) [0x555abacc6e68] plasma::PlasmaStoreRunner::Start()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x289445) [0x555abac64445] std::thread::_State_impl<>::_M_run()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x9db6e0) [0x555abb3b66e0] execute_native_thread_routine
(raylet, ip=10.31.133.85) /lib64/libpthread.so.0(+0x7ea5) [0x2ada504c8ea5] start_thread
(raylet, ip=10.31.133.85) /lib64/libc.so.6(clone+0x6d) [0x2ada50ce59fd] clone
(raylet, ip=10.31.133.85)
(raylet, ip=10.31.133.85) *** SIGABRT received at time=1666188347 on cpu 30 ***
(raylet, ip=10.31.133.85) PC: @ 0x2ada50c1d387 (unknown) raise
(raylet, ip=10.31.133.85) @ 0x2ada504d0630 1872 (unknown)
(raylet, ip=10.31.133.85) @ 0x2ada50389f47 362952000 __cxxabiv1::__terminate()
(raylet, ip=10.31.133.85) @ 0x2ada5038a095 (unknown) __cxa_tm_cleanup
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,005 E 6499 6581] (raylet) logging.cc:361: *** SIGABRT received at time=1666188347 on cpu 30 ***
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,005 E 6499 6581] (raylet) logging.cc:361: PC: @ 0x2ada50c1d387 (unknown) raise
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,006 E 6499 6581] (raylet) logging.cc:361: @ 0x2ada504d0630 1872 (unknown)
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,006 E 6499 6581] (raylet) logging.cc:361: @ 0x2ada50389f47 362952000 __cxxabiv1::__terminate()
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,006 E 6499 6581] (raylet) logging.cc:361: @ 0x2ada5038a095 (unknown) __cxa_tm_cleanup
(raylet, ip=10.31.133.85) [2022-10-19 10:05:46,972 E 6499 6581] (raylet) logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): bind: Address already in use
(raylet, ip=10.31.133.85) [2022-10-19 10:05:46,987 E 6499 6499] (raylet) logging.cc:97: Unhandled exception: N5boost10wrapexceptINS_6system12system_errorEEE. what(): bind: Address already in use
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,004 E 6499 6581] (raylet) logging.cc:104: Stack trace:
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x47c3ea) [0x555abae573ea] ray::operator<<()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x47ebb8) [0x555abae59bb8] ray::TerminateHandler()
(raylet, ip=10.31.133.85) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(+0xabf47) [0x2ada50389f47] __cxxabiv1::__terminate()
(raylet, ip=10.31.133.85) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(+0xabf7d) [0x2ada50389f7d] __cxxabiv1::__unexpected()
(raylet, ip=10.31.133.85) /n/sw/eb/apps/centos7/Anaconda3/2020.11/lib64/libstdc++.so.6(__cxa_rethrow+0) [0x2ada5038a15a] __cxa_rethrow
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x1389c8) [0x555abab139c8] boost::throw_exception<>()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x98cfb9) [0x555abb367fb9] boost::asio::detail::do_throw_error()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x1b974f) [0x555abab9474f] _ZN5boost4asio21basic_socket_acceptorINS0_7generic15stream_protocolENS0_9execution12any_executorIJNS4_12context_as_tIRNS0_17execution_contextEEENS4_6detail8blocking7never_tILi0EEENS4_11prefer_onlyINSB_10possibly_tILi0EEEEENSE_INSA_16outstanding_work9tracked_tILi0EEEEENSE_INSI_11untracked_tILi0EEEEENSE_INSA_12relationship6fork_tILi0EEEEENSE_INSP_14continuation_tILi0EEEEEEEEEC1I23instrumented_io_contextEERT_RKNS2_14basic_endpointIS3_EEbPNSt9enable_ifIXsrSt14is_convertibleIS11_S8_E5valueEvE4typeE
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x2f2d2d) [0x555abaccdd2d] plasma::PlasmaStore::PlasmaStore()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x2ebe68) [0x555abacc6e68] plasma::PlasmaStoreRunner::Start()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x289445) [0x555abac64445] std::thread::_State_impl<>::_M_run()
(raylet, ip=10.31.133.85) .../lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet(+0x9db6e0) [0x555abb3b66e0] execute_native_thread_routine
(raylet, ip=10.31.133.85) /lib64/libpthread.so.0(+0x7ea5) [0x2ada504c8ea5] start_thread
(raylet, ip=10.31.133.85) /lib64/libc.so.6(clone+0x6d) [0x2ada50ce59fd] clone
(raylet, ip=10.31.133.85)
(raylet, ip=10.31.133.85) *** SIGABRT received at time=1666188347 on cpu 30 ***
(raylet, ip=10.31.133.85) PC: @ 0x2ada50c1d387 (unknown) raise
(raylet, ip=10.31.133.85) @ 0x2ada504d0630 1872 (unknown)
(raylet, ip=10.31.133.85) @ 0x2ada50389f47 362952000 __cxxabiv1::__terminate()
(raylet, ip=10.31.133.85) @ 0x2ada5038a095 (unknown) __cxa_tm_cleanup
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,005 E 6499 6581] (raylet) logging.cc:361: *** SIGABRT received at time=1666188347 on cpu 30 ***
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,005 E 6499 6581] (raylet) logging.cc:361: PC: @ 0x2ada50c1d387 (unknown) raise
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,006 E 6499 6581] (raylet) logging.cc:361: @ 0x2ada504d0630 1872 (unknown)
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,006 E 6499 6581] (raylet) logging.cc:361: @ 0x2ada50389f47 362952000 __cxxabiv1::__terminate()
(raylet, ip=10.31.133.85) [2022-10-19 10:05:47,006 E 6499 6581] (raylet) logging.cc:361: @ 0x2ada5038a095 (unknown) __cxa_tm_cleanup
(raylet, ip=10.31.133.85) E1019 10:05:49.881884938 6572 server_chttp2.cc:48] {"created":"@1666188349.881824741","description":"No address added out of total 1 resolved","file":"src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":872,"referenced_errors":[{"created":"@1666188349.881814569","description":"Failed to add any wildcard listeners","file":"src/core/lib/iomgr/tcp_server_posix.cc","file_line":348,"referenced_errors":[{"created":"@1666188349.881801585","description":"Unable to configure socket","fd":14,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1666188349.881797781","description":"Address already in use","errno":98,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1666188349.881813655","description":"Unable to configure socket","fd":14,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1666188349.881810971","description":"Address already in use","errno":98,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]}]}]}
(raylet, ip=10.31.133.85) E1019 10:05:49.881884938 6572 server_chttp2.cc:48] {"created":"@1666188349.881824741","description":"No address added out of total 1 resolved","file":"src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":872,"referenced_errors":[{"created":"@1666188349.881814569","description":"Failed to add any wildcard listeners","file":"src/core/lib/iomgr/tcp_server_posix.cc","file_line":348,"referenced_errors":[{"created":"@1666188349.881801585","description":"Unable to configure socket","fd":14,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1666188349.881797781","description":"Address already in use","errno":98,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1666188349.881813655","description":"Unable to configure socket","fd":14,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1666188349.881810971","description":"Address already in use","errno":98,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]}]}]}
[2] tune.run()
crashing:
Traceback (most recent call last):
File ".../main.py", line 499, in <module>
main(args, args.num_cpus, group=args.experiment_group, name=args.experiment_name, ray_local_mode=args.ray_local_mode)
File ".../main.py", line 475, in main
tune.run(experiments, callbacks=callbacks, raise_on_failed_trial=False)
File ".../lib/python3.9/site-packages/ray/tune/tune.py", line 427, in run
return ray.get(remote_future)
File "..../lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return getattr(ray, func.__name__)(*args, **kwargs)
File ".../lib/python3.9/site-packages/ray/util/client/api.py", line 42, in get
return self.worker.get(vals, timeout=timeout)
File ".../lib/python3.9/site-packages/ray/util/client/worker.py", line 434, in get
res = self._get(to_get, op_timeout)
File ".../lib/python3.9/site-packages/ray/util/client/worker.py", line 462, in _get
raise err
ray.exceptions.RayTaskError: ray::run() (pid=88223, ip=10.31.143.135)
File ".../lib/python3.9/site-packages/ray/tune/tune.py", line 724, in run
_report_progress(runner, progress_reporter)
File ".../lib/python3.9/site-packages/ray/tune/tune.py", line 125, in _report_progress
reporter.report(trials, done, sched_debug_str, executor_debug_str)
File ".../lib/python3.9/site-packages/ray/tune/progress_reporter.py", line 641, in report
print(self._progress_str(trials, done, *sys_info))
File ".../lib/python3.9/site-packages/ray/tune/progress_reporter.py", line 347, in _progress_str
user_metrics = self._infer_user_metrics(trials, self._infer_limit)
File ".../lib/python3.9/site-packages/ray/tune/progress_reporter.py", line 396, in _infer_user_metrics
if not t.last_result:
File ".../lib/python3.9/site-packages/ray/tune/experiment/trial.py", line 445, in last_result
self._get_default_result_or_future()
File ".../lib/python3.9/site-packages/ray/tune/experiment/trial.py", line 420, in _get_default_result_or_future
self._default_result_or_future = ray.get(self._default_result_or_future)
ray.exceptions.RuntimeEnvSetupError: Failed to setup runtime environment.
Could not create the actor because its associated runtime env failed to be created.
Failed to create runtime environment {"env_vars": {"TUNE_ORIG_WORKING_DIR": "..."}} because the Ray agent couldn't be started due to the port conflict. See `dashboard_agent.log` for more details. To solve the problem, start Ray with a hard-coded agent port. `ray start --dashboard-agent-grpc-port [port]` and make sure the port is not used by other processes.