Cannot start Ray Cluster under gramine-sgx enclave

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

We are trying to integrate Ray (ray start) under Gramine-SGX enclaves: (Please google for gramine-sgx; I’m restricted from putting in more than 2-links on this post.)

Gramine is A library OS for Linux multi-process applications, with Intel SGX support.

I am running into basic Ray bootstrap issues, which I have also aired out in Gramine discussions 1680:

Problem Description: Ray ‘ray start’ seems to succeed for a brief while, before something on Ray’s process-watcher side reports that:

process_watcher.py:89 -- Raylet is considered dead 1 X

There are few different issues I’d like some troubleshooting help with.

  1. Messages seen while booting up ‘ray start’ from under gramine-direct:
2023-12-13 03:08:54,186	SUCC scripts.py:781 -- --------------------
2023-12-13 03:08:54,186	SUCC scripts.py:782 -- Ray runtime started.
2023-12-13 03:08:54,186	SUCC scripts.py:783 -- --------------------
2023-12-13 03:08:54,186	INFO scripts.py:785 -- Next steps
2023-12-13 03:08:54,186	INFO scripts.py:788 -- To add another node to this Ray cluster, run
2023-12-13 03:08:54,187	INFO scripts.py:791 --   ray start --address='10.208.196.155:6379'
2023-12-13 03:08:54,187	INFO scripts.py:800 -- To connect to this Ray cluster:
2023-12-13 03:08:54,187	INFO scripts.py:802 -- import ray
2023-12-13 03:08:54,187	INFO scripts.py:803 -- ray.init()
2023-12-13 03:08:54,187	INFO scripts.py:834 -- To terminate the Ray runtime, run
2023-12-13 03:08:54,187	INFO scripts.py:835 --   ray stop
2023-12-13 03:08:54,187	INFO scripts.py:838 -- To view the status of the cluster, use
2023-12-13 03:08:54,187	INFO scripts.py:839 --   ray status

It appears that ray start did succeed [1], but then I see messages like shown in [2],

  1. Snippets of messages from Ray’s dashboard_agent.log:
43 2023-12-11 22:22:20,007»INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <StaticResource  /logs -> PosixPath('/tmp/ray/session_2023-12-11_22-22-10_412945_1/logs')> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_c    onfig._CorsConfigImpl object at 0x2d4742bbf670>>
 44 2023-12-11 22:22:20,007»INFO http_server_agent.py:79 -- Registered 30 routes.
 45 2023-12-11 22:22:20,012»INFO process_watcher.py:44 -- raylet pid is 15
 46 2023-12-11 22:22:20,012»WARNING process_watcher.py:89 -- Raylet is considered dead 1 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
 47 2023-12-11 22:22:20,016»INFO event_agent.py:56 -- Report events to [10.208.196.155:45899](http://10.208.196.155:45899/)
 48 2023-12-11 22:22:20,017»INFO event_utils.py:132 -- Monitor events logs modified after 1702331539.842946 on /tmp/ray/session_2023-12-11_22-22-10_412945_1/logs/events, the source types are all.
 49 2023-12-11 22:22:20,019»ERROR reporter_agent.py:1149 -- Error publishing node physical stats.
 50 Traceback (most recent call last):
 51   File "/home/sgx/.local/lib/python3.8/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 1132, in _perform_iteration
 52     stats = self._get_all_stats()
 53   File "/home/sgx/.local/lib/python3.8/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 630, in _get_all_stats
 54     network_stats = self._get_network_stats()
 55   File "/home/sgx/.local/lib/python3.8/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 434, in _get_network_stats
 56     v for k, v in psutil.net_io_counters(pernic=True).items() if k[0] == "e"
 57   File "/home/sgx/.local/lib/python3.8/site-packages/ray/thirdparty_files/psutil/__init__.py", line 2122, in net_io_counters
 58     rawdict = _psplatform.net_io_counters()
 59   File "/home/sgx/.local/lib/python3.8/site-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1023, in net_io_counters
 60     with open_text("%s/net/dev" % get_procfs_path()) as f:
 61   File "/home/sgx/.local/lib/python3.8/site-packages/ray/thirdparty_files/psutil/_common.py", line 786, in open_text
 62     fobj = open(fname, buffering=FILE_READ_BUFFER_SIZE,
 63 FileNotFoundError: [Errno 2] No such file or directory: '/proc/net/dev'
 64 2023-12-11 22:22:20,414»WARNING process_watcher.py:89 -- Raylet is considered dead 2 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
 65 2023-12-11 22:22:20,816»WARNING process_watcher.py:89 -- Raylet is considered dead 3 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
 66 2023-12-11 22:22:21,217»WARNING process_watcher.py:89 -- Raylet is considered dead 4 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
 67 2023-12-11 22:22:21,618»WARNING process_watcher.py:89 -- Raylet is considered dead 5 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
 68 2023-12-11 22:22:21,618»INFO agent.py:227 -- Terminated Raylet: ip=10.208.196.155, node_id=5ea8e24111ca76d5135365a6bea0e7da046378d411339af896a80d4f.·
 69 2023-12-11 22:22:21,619»ERROR process_watcher.py:142 -- Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Oth    er termination signals. Last 20 lines of the Raylet logs:
 70     [state-dump] Event stats:
 71     [state-dump] »······PeriodicalRunner.RunFnPeriodically - 11 total (2 active, 1 running), CPU time: mean = 599.819 us, total = 6.598 ms
 72     [state-dump] »······NodeManager.ScheduleAndDispatchTasks - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
 73     [state-dump] »······NodeManager.deadline_timer.record_metrics - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
 74     [state-dump] »······NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
 75     [state-dump] »······NodeManager.GCTaskFailureReason - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
  1. Digging into Ray threads, I discovered this Ray-issue and a PR trying to address this node check issue:

Ray Issue-29412: [Ray Core] Ray agent getting killed unexpectedly

Which lead to a tentative code-fix in ray Python libraries,
Ray PR-29540: [Agent] Make agent shutdown more informative and graceful

The point of these two threads is that: Seems like there might have been some issue with Python library, psutil.Process.parent() misreporting that parent node is down, causing some cascading shutdowns on the Ray-side.

Does anyone know if these are real issues or red-herrings?

  1. In the Gramine-SGX, some OS devices are not supported.

So, in other attempts, I run into errors like so:

 43 2023-12-13 03:08:54,748»INFO http_server_agent.py:78 -- <ResourceRoute [OPTIONS] <StaticResource  /logs -> PosixPath('/tmp/ray/session_2023-12-13_03-08-42_327199_1/logs')> -> <bound m    ethod _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x35dbff9ce190>>
 44 2023-12-13 03:08:54,748»INFO http_server_agent.py:79 -- Registered 30 routes.
 45 2023-12-13 03:08:54,751»INFO process_watcher.py:44 -- raylet pid is 23
 46 2023-12-13 03:08:54,752»WARNING process_watcher.py:89 -- Raylet is considered dead 1 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_    for_parent: False, parent_changed: False.
 47 2023-12-13 03:08:54,755»INFO event_agent.py:56 -- Report events to 10.208.196.155:44593
 48 2023-12-13 03:08:54,755»INFO event_utils.py:132 -- Monitor events logs modified after 1702435134.608772 on /tmp/ray/session_2023-12-13_03-08-42_327199_1/logs/events, the source types     are all.
 49 2023-12-13 03:08:54,757»ERROR reporter_agent.py:1149 -- Error publishing node physical stats.
 50 Traceback (most recent call last):
 51   File "/home/sgx/.local/lib/python3.8/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 1132, in _perform_iteration
 52     stats = self._get_all_stats()
 53   File "/home/sgx/.local/lib/python3.8/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 634, in _get_all_stats
 54     disk_stats = self._get_disk_io_stats()
 55   File "/home/sgx/.local/lib/python3.8/site-packages/ray/dashboard/modules/reporter/reporter_agent.py", line 468, in _get_disk_io_stats
 56     stats = psutil.disk_io_counters()
 57   File "/home/sgx/.local/lib/python3.8/site-packages/ray/thirdparty_files/psutil/__init__.py", line 2072, in disk_io_counters
 58     rawdict = _psplatform.disk_io_counters(**kwargs)
 59   File "/home/sgx/.local/lib/python3.8/site-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1161, in disk_io_counters
 60     raise NotImplementedError(
 61 NotImplementedError: /proc/diskstats nor /sys/block filesystem are available on this system
 62 2023-12-13 03:08:55,153»WARNING process_watcher.py:89 -- Raylet is considered dead 2 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_    for_parent: False, parent_changed: False.
 63 2023-12-13 03:08:55,554»WARNING process_watcher.py:89 -- Raylet is considered dead 3 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_    for_parent: False, parent_changed: False.
 64 2023-12-13 03:08:55,956»WARNING process_watcher.py:89 -- Raylet is considered dead 4 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_    for_parent: False, parent_changed: False.
 65 2023-12-13 03:08:56,357»WARNING process_watcher.py:89 -- Raylet is considered dead 5 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_    for_parent: False, parent_changed: False.
 66 2023-12-13 03:08:56,357»INFO agent.py:227 -- Terminated Raylet: ip=10.208.196.155, node_id=e2578e7e7d7a5b50f6e32b09705f3aefe88cd11a807ebc8253730a9a.·
 67 2023-12-13 03:08:56,358»ERROR process_watcher.py:142 -- Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) In    valid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
 68     [state-dump] Event stats:
 69     [state-dump] »······PeriodicalRunner.RunFnPeriodically - 11 total (2 active, 1 running), CPU time: mean = 426.345 us, total = 4.690 ms
 70     [state-dump] »······ObjectManager.UpdateAvailableMemory - 1 total (0 active), CPU time: mean = 7.101 us, total = 7.101 us

I am starting Ray with --include-dashboard=false --disable-usage-stats options. But still seems like Ray’s dashboard utility methods (dashboard/modules/reporter/reporter_agent.py) are taking off.

What is the command-line parameter to completely TURN OFF this dashboard / monitoring / metrics-collection modules?

At least if I can by-pass this checking, I can avoid bumping into the unsupported devices feature (limitation) of Gramine, just to see how much further I can move this node boostrapping further.

Thanks, in advance,
–AdityA>