Errors running inside Apache Mesos

RickLan · August 20, 2021, 12:01pm

Is running in Mesos supported? I have a small test script that runs fine on my laptop but seems stuck in Mesos (on a server).

Script: A starter example that trains, checkpoints and evaluates a RL algorithm in RLlib · GitHub
Python 3.6
Ray 1.5.2

In /tmp/ray/session-latest/logs:

$ more python-core-worker-ce2ebe3ec1155a2b1b632cfdf3d7e24d6509bd005730bbaa089ce074_13951.log
[2021-08-20 18:47:19,700 I 13951 13951] core_worker.cc:152: Constructing CoreWorkerProcess. pid: 13951
[2021-08-20 18:47:19,716 I 13951 13951] core_worker.cc:374: Constructing CoreWorker, worker_id: ce2ebe3ec1155a2b1b632cfdf3d
7e24d6509bd005730bbaa089ce074
[2021-08-20 18:47:19,716 I 13951 13951] grpc_server.cc:71: worker server started, listening on port 34417.
[2021-08-20 18:47:19,721 I 13951 13951] core_worker.cc:438: Initializing worker at address: 100.99.156.11:34417, worker ID
ce2ebe3ec1155a2b1b632cfdf3d7e24d6509bd005730bbaa089ce074, raylet 3145084fdfebfb28eaee8050593f51f0f7511c003948be2523e5be6a
[2021-08-20 18:47:19,737 E 13951 13951] logging.cc:299: *** SIGABRT received at time=1629452839 on cpu 18 ***
[2021-08-20 18:47:19,737 E 13951 13951] logging.cc:299: PC: @     0x7fa8e8137337  (unknown)  raise
[2021-08-20 18:47:19,738 E 13951 13951] logging.cc:299:     @     0x7fa8e8def5f0  (unknown)  (unknown)
[2021-08-20 18:47:19,739 E 13951 13951] logging.cc:299:     @ 0x61726f706d657420  (unknown)  (unknown)

$ more raylet.err
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
*** SIGABRT received at time=1629452839 on cpu 34 ***
PC: @     0x7f7c4e3dd337  (unknown)  raise
    @     0x7f7c4f0955f0  (unknown)  (unknown)
    @ 0x61726f706d657420  (unknown)  (unknown)
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
*** SIGABRT received at time=1629452839 on cpu 56 ***
PC: @     0x7fc352360337  (unknown)  raise
    @     0x7fc3530185f0  (unknown)  (unknown)
    @ 0x61726f706d657420  (unknown)  (unknown)
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
*** SIGABRT received at time=1629452839 on cpu 54 ***
PC: @     0x7fa873277337  (unknown)  raise
    @     0x7fa873f2f5f0  (unknown)  (unknown)
    @ 0x61726f706d657420  (unknown)  (unknown)
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
*** SIGABRT received at time=1629452839 on cpu 64 ***
PC: @     0x7f4832995337  (unknown)  raise
    @     0x7f483364d5f0  (unknown)  (unknown)
    @ 0x61726f706d657420  (unknown)  (unknown)
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
*** SIGABRT received at time=1629452839 on cpu 54 ***
PC: @     0x7f60c964b337  (unknown)  raise
    @     0x7f60ca3035f0  (unknown)  (unknown)
    @ 0x61726f706d657420  (unknown)  (unknown)
/bin/sh: fork: retry: No child processes
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
*** SIGABRT received at time=1629452839 on cpu 86 ***
PC: @     0x7eff124d9337  (unknown)  raise
    @     0x7eff131915f0  (unknown)  (unknown)
    @ 0x61726f706d657420  (unknown)  (unknown)
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
*** SIGABRT received at time=1629452839 on cpu 80 ***
PC: @     0x7f70be8b4337  (unknown)  raise
    @     0x7f70bf56c5f0  (unknown)  (unknown)
    @ 0x61726f706d657420  (unknown)  (unknown)
/bin/sh: fork: retry: No child processes
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
*** SIGABRT received at time=1629452839 on cpu 6 ***
/bin/sh: fork: retry: No child processes
PC: @     0x7f73e03e0337  (unknown)  raise
    @     0x7f73e10985f0  (unknown)  (unknown)
    @ 0x61726f706d657420  (unknown)  (unknown)
Traceback (most recent call last):
  File "/home/rick.lan/.local/lib/python3.6/site-packages/ray/workers/default_worker.py", line 187, in <module>
    worker_shim_pid=args.worker_shim_pid)
  File "/home/rick.lan/.local/lib/python3.6/site-packages/ray/worker.py", line 1361, in connect
    worker.import_thread.start()
  File "/home/rick.lan/.local/lib/python3.6/site-packages/ray/_private/import_thread.py", line 46, in start
    self.t.start()
  File "/opt/conda/lib/python3.6/threading.py", line 846, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/rick.lan/.local/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 82, in wrapper

TanjaBayer · August 20, 2021, 2:13pm

We where running it on dcos which is build on mesos. And it was running.

I think the main question her is: how do you run the cluster?

RickLan · August 20, 2021, 11:04pm

Thanks for sharing.

I don’t know how the cluster is run. I’m an end user, not the admin. I did install Ray in the instance I was given. I probably need to work with my Mesos admin. Is there something else I should give them about Ray?

TanjaBayer · August 21, 2021, 6:48am

What I mean:
is it one node with several CPU’s or different nodes (a cluster) with several CPU‘s

We where running all our stuff I side docker containers, that might be a difference. But I am also only a user not an expert there…

RickLan · August 21, 2021, 7:25am

Good question. I have no idea… above my pay grade

RickLan · August 23, 2021, 7:39am

It turns out something was wrong with my Mesos instance. After recreating it from scratch, the test script ran fine.

Topic		Replies	Views
(raylet) Some workers of the worker process(68497) have not registered within the timeout. The process is still alive, probably it's hanging during start Ray Clusters	4	2497	May 26, 2022
Big cluster job failing due to SIGBUS in plasma Ray Core	16	929	July 12, 2021
Raylet exits abnormally when setting up a local Ray Cluster Ray Clusters	1	763	April 19, 2023
Raylet error Check failed: addr_proto.worker_id() != "" Ray Clusters	0	12	June 30, 2024
Ray crashes on Slurm Ray Clusters	6	1380	October 27, 2022

Errors running inside Apache Mesos

Related topics