How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hello,
We’ve been using Ray actors for a while without much problems, however, recently our error handling (which hasn’t triggered for a while) within the ray actor exhibited strange and inconsistent behavior.
I have reproduced the error in bpython for a small reproducible case below.
Basically, we spin off an asyncio task within the actor, which may fail and kill the actor depending on the result.
Below, we always fail, but with varying delays. Sometimes we fail before the main process awaits the wait_for_task
sometimes after. In both cases, it randomly fails with SYSTEM_ERROR
, although we expect a graceful INTENDED_USER_EXIT
. This causes bigger problems in our actual application code, which you can’t see with this simple example of a single actor example.
Setup Code
bpython version 0.24 on top of Python 3.10.9 /home/[…]/.cache/pypoetry/virtualenvs/[…]-ecZlKSFR-py3.10/bin/python
>>> import ray
>>> import asyncio
>>>
>>> @ray.remote
... class Actor:
... def __init__(self, fail_after: int):
... self.task = asyncio.create_task(self.run_task(fail_after))
...
... async def run_task(self, fail_after: int):
... await asyncio.sleep(fail_after)
... ray.actor.exit_actor()
...
... async def wait_for_task(self):
... await self.task
...
...
...
>>> async def test_flow(fail_after: int):
... actor = Actor.remote(fail_after)
... await asyncio.sleep(2)
... await actor.wait_for_task.remote()
... print("Done!")
...
...
>>> ray.init()
2023-02-23 12:10:53,936 INFO worker.py:1538 -- Started a local Ray instance.
RayContext(dashboard_url='', python_version='3.10.9', ray_version='2.2.0', ray_commit='b6af0887ee5f2e460202133791ad941a41f15beb', address_info={'node_ip_address': '172.20.112.29', 'raylet_ip_address': '172.20.112.29', 'redis_address': None, 'object_store_address': '/tmp/ray/session_2023-02-23_12-10-52_222376_9462/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2023-02-23_12-10-52_222376_9462/sockets/raylet', 'webui_url': '', 'session_dir': '/tmp/ray/session_2023-02-23_12-10-52_222376_9462', 'metrics_export_port': 64329, 'gcs_address': '172.20.112.29:58132', 'address': '172.20.112.29:58132', 'dashboard_agent_listen_port': 52365, 'node_id': '54813d21ecddaa6145dd2be897c025a32da5ac1b4f3d987180631d33'})
Test Code
>>> asyncio.run(test_flow(3))
Traceback (most recent call last):
File "<input>", line 1, in <module>
asyncio.run(test_flow(3))
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "<input>", line 4, in test_flow
await actor.wait_for_task.remote()
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: Actor
actor_id: 50b29daf8b55816bc011ba3401000000
pid: 10853
namespace: abcb4eed-9a6d-4647-9e81-1ecd3bd7c1e5
ip: 172.20.112.29
The actor is dead because its worker process has died. Worker exit type: INTENDED_USER_EXIT Worker exit detail: Worker exits by an user request. exit_actor() is called.
>>> asyncio.run(test_flow(3))
Traceback (most recent call last):
File "<input>", line 1, in <module>
asyncio.run(test_flow(3))
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "<input>", line 4, in test_flow
await actor.wait_for_task.remote()
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: Actor
actor_id: e902d6517fd90dd991526d4201000000
pid: 10915
namespace: abcb4eed-9a6d-4647-9e81-1ecd3bd7c1e5
ip: 172.20.112.29
The actor is dead because its worker process has died. Worker exit type: INTENDED_USER_EXIT Worker exit detail: Worker exits by an user request. exit_actor() is called.
>>> asyncio.run(test_flow(1))
Traceback (most recent call last):
File "<input>", line 1, in <module>
asyncio.run(test_flow(1))
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "<input>", line 4, in test_flow
await actor.wait_for_task.remote()
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: Actor
actor_id: 59fa8e5c98e6d5aa14c8efb501000000
pid: 10977
namespace: abcb4eed-9a6d-4647-9e81-1ecd3bd7c1e5
ip: 172.20.112.29
The actor is dead because its worker process has died. Worker exit type: INTENDED_USER_EXIT Worker exit detail: Worker exits by an user request. exit_actor() is called.
>>> asyncio.run(test_flow(0))
2023-02-23 12:11:50,693 WARNING worker.py:1851 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffa7b983ceec1
44a3ecdae053301000000 Worker ID: 23931a8a70fbcbc6acc411033656779b51dc5eeb6e58a31e629eb065 Node ID: 54813d21ecddaa6145dd2be897c025a32da5ac1b4f3d987180631d33 Worker IP address: 172.20.112.29 Worker port: 39009 Worker PID: 11039 Worker exi
t type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray s
top --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
--- Logging error ---
Exception in thread Exception in threading.excepthook:Exception ignored in thread started byException ignored in sys.unraisablehookTraceback (most recent call last):
File "<input>", line 1, in <module>
asyncio.run(test_flow(0))
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "<input>", line 4, in test_flow
await actor.wait_for_task.remote()
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: Actor
actor_id: a7b983ceec144a3ecdae053301000000
pid: 11039
namespace: abcb4eed-9a6d-4647-9e81-1ecd3bd7c1e5
ip: 172.20.112.29
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is ki
lled by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
>>>