hi @pratkpranav
looking at the logs it seems what happened is agent process crashed and brought down the raylet process
raylet.out:
[2022-10-05 10:30:47,880 W 288188 288227] (raylet) agent_manager.cc:131: Agent process with id 424238335 exited, return value 0. ip 172.29.58.148. id 424238335
[2022-10-05 10:30:47,880 E 288188 288227] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.
[2022-10-05 10:30:47,880 D 288188 288227] (raylet) logging.cc:323: Uninstall signal handlers.
agent.log
2022-10-05 08:11:31,109 INFO runtime_env_agent.py:410 -- Runtime env already created successfully. Env: {"env_vars": {"OMP_NUM_THREADS": "4"}}, context: {"command_prefix": [], "env_vars": {"OMP_NUM_THREADS": "4"}, "py_executable": "/usr/bin/python3", "resources_dir": null, "container": {}, "java_jars": []}
2022-10-05 10:30:47,443 ERROR agent.py:217 -- Raylet is terminated: ip=172.29.58.148, id=d2a5c564a71c468ddacdab4a7f8e1c29c69b626acc3d5f1457731b28. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
...
And the sequence is agent first detected Raylet is dead at 10:30:47,443. Interestingly the raylet.out log where the node crashed started at 10:30:47,818; which happened after the agent detected raylet failure.
I wonder if you have more logs, or other raylet related files on the node that raylet failed?