Errors running inside Apache Mesos

Is running in Mesos supported? I have a small test script that runs fine on my laptop but seems stuck in Mesos (on a server).

Script: A starter example that trains, checkpoints and evaluates a RL algorithm in RLlib · GitHub
Python 3.6
Ray 1.5.2

In /tmp/ray/session-latest/logs:

$ more python-core-worker-ce2ebe3ec1155a2b1b632cfdf3d7e24d6509bd005730bbaa089ce074_13951.log
[2021-08-20 18:47:19,700 I 13951 13951] core_worker.cc:152: Constructing CoreWorkerProcess. pid: 13951
[2021-08-20 18:47:19,716 I 13951 13951] core_worker.cc:374: Constructing CoreWorker, worker_id: ce2ebe3ec1155a2b1b632cfdf3d
7e24d6509bd005730bbaa089ce074
[2021-08-20 18:47:19,716 I 13951 13951] grpc_server.cc:71: worker server started, listening on port 34417.
[2021-08-20 18:47:19,721 I 13951 13951] core_worker.cc:438: Initializing worker at address: 100.99.156.11:34417, worker ID
ce2ebe3ec1155a2b1b632cfdf3d7e24d6509bd005730bbaa089ce074, raylet 3145084fdfebfb28eaee8050593f51f0f7511c003948be2523e5be6a
[2021-08-20 18:47:19,737 E 13951 13951] logging.cc:299: *** SIGABRT received at time=1629452839 on cpu 18 ***
[2021-08-20 18:47:19,737 E 13951 13951] logging.cc:299: PC: @     0x7fa8e8137337  (unknown)  raise
[2021-08-20 18:47:19,738 E 13951 13951] logging.cc:299:     @     0x7fa8e8def5f0  (unknown)  (unknown)
[2021-08-20 18:47:19,739 E 13951 13951] logging.cc:299:     @ 0x61726f706d657420  (unknown)  (unknown)
$ more raylet.err
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
*** SIGABRT received at time=1629452839 on cpu 34 ***
PC: @     0x7f7c4e3dd337  (unknown)  raise
    @     0x7f7c4f0955f0  (unknown)  (unknown)
    @ 0x61726f706d657420  (unknown)  (unknown)
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
*** SIGABRT received at time=1629452839 on cpu 56 ***
PC: @     0x7fc352360337  (unknown)  raise
    @     0x7fc3530185f0  (unknown)  (unknown)
    @ 0x61726f706d657420  (unknown)  (unknown)
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
*** SIGABRT received at time=1629452839 on cpu 54 ***
PC: @     0x7fa873277337  (unknown)  raise
    @     0x7fa873f2f5f0  (unknown)  (unknown)
    @ 0x61726f706d657420  (unknown)  (unknown)
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
*** SIGABRT received at time=1629452839 on cpu 64 ***
PC: @     0x7f4832995337  (unknown)  raise
    @     0x7f483364d5f0  (unknown)  (unknown)
    @ 0x61726f706d657420  (unknown)  (unknown)
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
*** SIGABRT received at time=1629452839 on cpu 54 ***
PC: @     0x7f60c964b337  (unknown)  raise
    @     0x7f60ca3035f0  (unknown)  (unknown)
    @ 0x61726f706d657420  (unknown)  (unknown)
/bin/sh: fork: retry: No child processes
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
*** SIGABRT received at time=1629452839 on cpu 86 ***
PC: @     0x7eff124d9337  (unknown)  raise
    @     0x7eff131915f0  (unknown)  (unknown)
    @ 0x61726f706d657420  (unknown)  (unknown)
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
*** SIGABRT received at time=1629452839 on cpu 80 ***
PC: @     0x7f70be8b4337  (unknown)  raise
    @     0x7f70bf56c5f0  (unknown)  (unknown)
    @ 0x61726f706d657420  (unknown)  (unknown)
/bin/sh: fork: retry: No child processes
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
*** SIGABRT received at time=1629452839 on cpu 6 ***
/bin/sh: fork: retry: No child processes
PC: @     0x7f73e03e0337  (unknown)  raise
    @     0x7f73e10985f0  (unknown)  (unknown)
    @ 0x61726f706d657420  (unknown)  (unknown)
Traceback (most recent call last):
  File "/home/rick.lan/.local/lib/python3.6/site-packages/ray/workers/default_worker.py", line 187, in <module>
    worker_shim_pid=args.worker_shim_pid)
  File "/home/rick.lan/.local/lib/python3.6/site-packages/ray/worker.py", line 1361, in connect
    worker.import_thread.start()
  File "/home/rick.lan/.local/lib/python3.6/site-packages/ray/_private/import_thread.py", line 46, in start
    self.t.start()
  File "/opt/conda/lib/python3.6/threading.py", line 846, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/rick.lan/.local/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 82, in wrapper

We where running it on dcos which is build on mesos. And it was running.

I think the main question her is: how do you run the cluster?

Thanks for sharing.

I don’t know how the cluster is run. I’m an end user, not the admin. I did install Ray in the instance I was given. I probably need to work with my Mesos admin. Is there something else I should give them about Ray?

What I mean:
is it one node with several CPU’s or different nodes (a cluster) with several CPU‘s

We where running all our stuff I side docker containers, that might be a difference. But I am also only a user not an expert there…

Good question. I have no idea… above my pay grade :stuck_out_tongue:

It turns out something was wrong with my Mesos instance. After recreating it from scratch, the test script ran fine.

1 Like