Is running in Mesos supported? I have a small test script that runs fine on my laptop but seems stuck in Mesos (on a server).
Script: A starter example that trains, checkpoints and evaluates a RL algorithm in RLlib · GitHub
Python 3.6
Ray 1.5.2
In /tmp/ray/session-latest/logs
:
$ more python-core-worker-ce2ebe3ec1155a2b1b632cfdf3d7e24d6509bd005730bbaa089ce074_13951.log
[2021-08-20 18:47:19,700 I 13951 13951] core_worker.cc:152: Constructing CoreWorkerProcess. pid: 13951
[2021-08-20 18:47:19,716 I 13951 13951] core_worker.cc:374: Constructing CoreWorker, worker_id: ce2ebe3ec1155a2b1b632cfdf3d
7e24d6509bd005730bbaa089ce074
[2021-08-20 18:47:19,716 I 13951 13951] grpc_server.cc:71: worker server started, listening on port 34417.
[2021-08-20 18:47:19,721 I 13951 13951] core_worker.cc:438: Initializing worker at address: 100.99.156.11:34417, worker ID
ce2ebe3ec1155a2b1b632cfdf3d7e24d6509bd005730bbaa089ce074, raylet 3145084fdfebfb28eaee8050593f51f0f7511c003948be2523e5be6a
[2021-08-20 18:47:19,737 E 13951 13951] logging.cc:299: *** SIGABRT received at time=1629452839 on cpu 18 ***
[2021-08-20 18:47:19,737 E 13951 13951] logging.cc:299: PC: @ 0x7fa8e8137337 (unknown) raise
[2021-08-20 18:47:19,738 E 13951 13951] logging.cc:299: @ 0x7fa8e8def5f0 (unknown) (unknown)
[2021-08-20 18:47:19,739 E 13951 13951] logging.cc:299: @ 0x61726f706d657420 (unknown) (unknown)
$ more raylet.err
terminate called after throwing an instance of 'std::system_error'
what(): Resource temporarily unavailable
*** SIGABRT received at time=1629452839 on cpu 34 ***
PC: @ 0x7f7c4e3dd337 (unknown) raise
@ 0x7f7c4f0955f0 (unknown) (unknown)
@ 0x61726f706d657420 (unknown) (unknown)
terminate called after throwing an instance of 'std::system_error'
what(): Resource temporarily unavailable
*** SIGABRT received at time=1629452839 on cpu 56 ***
PC: @ 0x7fc352360337 (unknown) raise
@ 0x7fc3530185f0 (unknown) (unknown)
@ 0x61726f706d657420 (unknown) (unknown)
terminate called after throwing an instance of 'std::system_error'
what(): Resource temporarily unavailable
*** SIGABRT received at time=1629452839 on cpu 54 ***
PC: @ 0x7fa873277337 (unknown) raise
@ 0x7fa873f2f5f0 (unknown) (unknown)
@ 0x61726f706d657420 (unknown) (unknown)
terminate called after throwing an instance of 'std::system_error'
what(): Resource temporarily unavailable
*** SIGABRT received at time=1629452839 on cpu 64 ***
PC: @ 0x7f4832995337 (unknown) raise
@ 0x7f483364d5f0 (unknown) (unknown)
@ 0x61726f706d657420 (unknown) (unknown)
terminate called after throwing an instance of 'std::system_error'
what(): Resource temporarily unavailable
*** SIGABRT received at time=1629452839 on cpu 54 ***
PC: @ 0x7f60c964b337 (unknown) raise
@ 0x7f60ca3035f0 (unknown) (unknown)
@ 0x61726f706d657420 (unknown) (unknown)
/bin/sh: fork: retry: No child processes
terminate called after throwing an instance of 'std::system_error'
what(): Resource temporarily unavailable
*** SIGABRT received at time=1629452839 on cpu 86 ***
PC: @ 0x7eff124d9337 (unknown) raise
@ 0x7eff131915f0 (unknown) (unknown)
@ 0x61726f706d657420 (unknown) (unknown)
terminate called after throwing an instance of 'std::system_error'
what(): Resource temporarily unavailable
*** SIGABRT received at time=1629452839 on cpu 80 ***
PC: @ 0x7f70be8b4337 (unknown) raise
@ 0x7f70bf56c5f0 (unknown) (unknown)
@ 0x61726f706d657420 (unknown) (unknown)
/bin/sh: fork: retry: No child processes
terminate called after throwing an instance of 'std::system_error'
what(): Resource temporarily unavailable
*** SIGABRT received at time=1629452839 on cpu 6 ***
/bin/sh: fork: retry: No child processes
PC: @ 0x7f73e03e0337 (unknown) raise
@ 0x7f73e10985f0 (unknown) (unknown)
@ 0x61726f706d657420 (unknown) (unknown)
Traceback (most recent call last):
File "/home/rick.lan/.local/lib/python3.6/site-packages/ray/workers/default_worker.py", line 187, in <module>
worker_shim_pid=args.worker_shim_pid)
File "/home/rick.lan/.local/lib/python3.6/site-packages/ray/worker.py", line 1361, in connect
worker.import_thread.start()
File "/home/rick.lan/.local/lib/python3.6/site-packages/ray/_private/import_thread.py", line 46, in start
self.t.start()
File "/opt/conda/lib/python3.6/threading.py", line 846, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/home/rick.lan/.local/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 82, in wrapper