Head and worked node dies after few seconds

Hi! I am referring the documentation Getting Started — Ray 3.0.0.dev0 for setting up ray cluster on my local machine using docker.

I followed through all the steps and I could see both head and worker pods running. However on the dashboard both the nodes turn DEAD immediately after few seconds.

kubectl get pods

Dashboard

I’m using the sample Ray Cluster CR recommended in the documentation without any changes.

Below are the logs files picked up from the head node /tmp/ray/session_latest/logs directory

raylet.err
[2023-03-23 22:01:27,704 E 229 336] (raylet) agent_manager.cc:135: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. Agent can fail when

  • The version of grpcio doesn’t follow Ray’s requirement. Agent can segfault with the incorrect grpcio version. Check the grpcio version pip freeze | grep grpcio.
  • The agent failed to start because of unexpected error or port conflict. Read the log cat /tmp/ray/session_latest/dashboard_agent.log. You can find the log file structure here Logging — Ray 3.0.0.dev0.
  • The agent is killed by the OS (e.g., out of memory).

dashboard_agent.log
2023-03-23 22:01:25,093 INFO agent.py:105 – Parent pid is 229
2023-03-23 22:01:25,147 INFO agent.py:131 – Dashboard agent grpc address: 0.0.0.0:57134
2023-03-23 22:01:25,156 WARNING agent.py:197 – Raylet is considered dead 1 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
2023-03-23 22:01:25,161 INFO utils.py:112 – Get all modules by type: DashboardAgentModule
2023-03-23 22:01:26,209 INFO utils.py:145 – Available modules: [<class ‘ray.dashboard.modules.event.event_agent.EventAgent’>, <class ‘ray.dashboard.modules.healthz.healthz_agent.HealthzAgent’>, <class ‘ray.dashboard.modules.job.job_agent.JobAgent’>, <class ‘ray.dashboard.modules.log.log_agent.LogAgent’>, <class ‘ray.dashboard.modules.log.log_agent.LogAgentV1Grpc’>, <class ‘ray.dashboard.modules.reporter.reporter_agent.ReporterAgent’>, <class ‘ray.dashboard.modules.runtime_env.runtime_env_agent.RuntimeEnvAgent’>, <class ‘ray.dashboard.modules.serve.serve_agent.ServeAgent’>]
2023-03-23 22:01:26,210 INFO agent.py:161 – Loading DashboardAgentModule: <class ‘ray.dashboard.modules.event.event_agent.EventAgent’>
2023-03-23 22:01:26,211 INFO event_agent.py:38 – Event agent cache buffer size: 10240
2023-03-23 22:01:26,212 INFO agent.py:161 – Loading DashboardAgentModule: <class ‘ray.dashboard.modules.healthz.healthz_agent.HealthzAgent’>
2023-03-23 22:01:26,212 INFO agent.py:161 – Loading DashboardAgentModule: <class ‘ray.dashboard.modules.job.job_agent.JobAgent’>
2023-03-23 22:01:26,212 INFO agent.py:161 – Loading DashboardAgentModule: <class ‘ray.dashboard.modules.log.log_agent.LogAgent’>
2023-03-23 22:01:26,212 INFO agent.py:161 – Loading DashboardAgentModule: <class ‘ray.dashboard.modules.log.log_agent.LogAgentV1Grpc’>
2023-03-23 22:01:26,213 INFO agent.py:161 – Loading DashboardAgentModule: <class ‘ray.dashboard.modules.reporter.reporter_agent.ReporterAgent’>
2023-03-23 22:01:26,227 INFO agent.py:161 – Loading DashboardAgentModule: <class ‘ray.dashboard.modules.runtime_env.runtime_env_agent.RuntimeEnvAgent’>
2023-03-23 22:01:26,229 INFO agent.py:161 – Loading DashboardAgentModule: <class ‘ray.dashboard.modules.serve.serve_agent.ServeAgent’>
2023-03-23 22:01:26,229 INFO agent.py:165 – Loaded 8 modules.
2023-03-23 22:01:26,245 WARNING agent.py:197 – Raylet is considered dead 2 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
2023-03-23 22:01:26,246 INFO http_server_agent.py:75 – Dashboard agent http address: 0.0.0.0:52365
2023-03-23 22:01:26,246 INFO http_server_agent.py:81 – <ResourceRoute [GET] <PlainResource /api/local_raylet_healthz> → <function HealthzAgent.health_check at 0x408cc92830>
2023-03-23 22:01:26,247 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <PlainResource /api/local_raylet_healthz> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x408d458d90>>
2023-03-23 22:01:26,247 INFO http_server_agent.py:81 – <ResourceRoute [POST] <PlainResource /api/job_agent/jobs/> → <function JobAgent.submit_job at 0x408ccd2200>
2023-03-23 22:01:26,247 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <PlainResource /api/job_agent/jobs/> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x408d458d90>>
2023-03-23 22:01:26,247 INFO http_server_agent.py:81 – <ResourceRoute [POST] <DynamicResource /api/job_agent/jobs/{job_or_submission_id}/stop> → <function JobAgent.stop_job at 0x408ccd2b00>
2023-03-23 22:01:26,248 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <DynamicResource /api/job_agent/jobs/{job_or_submission_id}/stop> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x408d458d90>>
2023-03-23 22:01:26,248 INFO http_server_agent.py:81 – <ResourceRoute [DELETE] <DynamicResource /api/job_agent/jobs/{job_or_submission_id}> → <function JobAgent.delete_job at 0x408ccda3b0>
2023-03-23 22:01:26,248 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <DynamicResource /api/job_agent/jobs/{job_or_submission_id}> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x408d458d90>>
2023-03-23 22:01:26,248 INFO http_server_agent.py:81 – <ResourceRoute [GET] <DynamicResource /api/job_agent/jobs/{job_or_submission_id}/logs> → <function JobAgent.get_job_logs at 0x408ccda560>
2023-03-23 22:01:26,249 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <DynamicResource /api/job_agent/jobs/{job_or_submission_id}/logs> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x408d458d90>>
2023-03-23 22:01:26,249 INFO http_server_agent.py:81 – <ResourceRoute [GET] <DynamicResource /api/job_agent/jobs/{job_or_submission_id}/logs/tail> → <function JobAgent.tail_job_logs at 0x408ccda710>
2023-03-23 22:01:26,249 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <DynamicResource /api/job_agent/jobs/{job_or_submission_id}/logs/tail> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x408d458d90>>
2023-03-23 22:01:26,249 INFO http_server_agent.py:81 – <ResourceRoute [GET] <PlainResource /api/ray/version> → <function ServeAgent.get_version at 0x408d3f0170>
2023-03-23 22:01:26,250 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <PlainResource /api/ray/version> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x408d458d90>>
2023-03-23 22:01:26,250 INFO http_server_agent.py:81 – <ResourceRoute [GET] <PlainResource /api/serve/deployments/> → <function ServeAgent.get_all_deployments at 0x408d3f0200>
2023-03-23 22:01:26,250 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <PlainResource /api/serve/deployments/> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x408d458d90>>
2023-03-23 22:01:26,250 INFO http_server_agent.py:81 – <ResourceRoute [GET] <PlainResource /api/serve/deployments/status> → <function ServeAgent.get_all_deployment_statuses at 0x408d3f03b0>
2023-03-23 22:01:26,251 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <PlainResource /api/serve/deployments/status> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x408d458d90>>
2023-03-23 22:01:26,251 INFO http_server_agent.py:81 – <ResourceRoute [DELETE] <PlainResource /api/serve/deployments/> → <function ServeAgent.delete_serve_application at 0x408d3f0560>
2023-03-23 22:01:26,251 INFO http_server_agent.py:81 – <ResourceRoute [PUT] <PlainResource /api/serve/deployments/> → <function ServeAgent.put_all_deployments at 0x408d3f0710>
2023-03-23 22:01:26,251 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <PlainResource /api/serve/deployments/> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x408d458d90>>
2023-03-23 22:01:26,252 INFO http_server_agent.py:81 – <ResourceRoute [GET] <StaticResource /logs → PosixPath(‘/tmp/ray/session_2023-03-23_22-01-12_707254_14/logs’)> → <bound method StaticResource._handle of <StaticResource /logs → PosixPath(‘/tmp/ray/session_2023-03-23_22-01-12_707254_14/logs’)>>
2023-03-23 22:01:26,252 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <StaticResource /logs → PosixPath(‘/tmp/ray/session_2023-03-23_22-01-12_707254_14/logs’)> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x408d458d90>>
2023-03-23 22:01:26,252 INFO http_server_agent.py:82 – Registered 23 routes.
2023-03-23 22:01:26,324 INFO event_agent.py:56 – Report events to 10.244.0.6:34309
2023-03-23 22:01:26,327 INFO event_utils.py:136 – Monitor events logs modified after 1679632285.8408732 on /tmp/ray/session_2023-03-23_22-01-12_707254_14/logs/events, the source types are all.
2023-03-23 22:01:26,649 WARNING agent.py:197 – Raylet is considered dead 3 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
2023-03-23 22:01:27,052 WARNING agent.py:197 – Raylet is considered dead 4 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
2023-03-23 22:01:27,458 WARNING agent.py:197 – Raylet is considered dead 5 X. If it reaches to 5, the agent will kill itself. Parent: None, parent_gone: True, init_assigned_for_parent: False, parent_changed: False.
2023-03-23 22:01:27,459 ERROR agent.py:249 – Raylet is terminated: ip=10.244.0.6, id=1c909ca6360bf4cabdbb6fc9f7e5192657f6263fef071fc66ccf892c. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
[state-dump] UNKNOWN - 2 total (2 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (1 active), CPU time: mean = 1.593 ms, total = 3.185 ms
[state-dump] NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 67.051 ms, total = 67.051 ms
[state-dump] NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 1.311 s, total = 1.311 s
[state-dump] ObjectManager.UpdateAvailableMemory - 1 total (0 active), CPU time: mean = 272.750 us, total = 272.750 us
[state-dump] NodeManagerService.grpc_server.GetResourceLoad - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] MemoryMonitor.CheckIsMemoryUsageAboveThreshold - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] NodeManager.deadline_timer.record_metrics - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] NodeManagerService.grpc_server.RequestResourceReport - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] RayletWorkerPool.deadline_timer.kill_idle_workers - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] NodeManager.deadline_timer.flush_free_objects - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] DebugString() time ms: 1
[state-dump]
[state-dump]
[2023-03-23 22:01:21,187 I 229 229] (raylet) accessor.cc:590: Received notification for node id = 1c909ca6360bf4cabdbb6fc9f7e5192657f6263fef071fc66ccf892c, IsAlive = 1
[2023-03-23 22:01:23,106 I 229 229] (raylet) accessor.cc:590: Received notification for node id = cc23449b1f256d08bd9714766584bb34c9cab61d1fc0352ee59cd884, IsAlive = 1
[2023-03-23 22:01:26,308 I 229 229] (raylet) agent_manager.cc:40: HandleRegisterAgent, ip: 10.244.0.6, port: 57134, id: 424238335

I didn’t find anything worthwhile in gcs_server.out

Appreciate any help on this!

@VD23 Just curious. Why are you using docker on a local machine?

Can’t you follow the instructions on getting started here.

What type is your local machine?

Thanks for your response @Jules_Damji. I use Mac with M1 chip. I did test the local ray cluster setup and was able to run a sample job. But, I’d eventually want to run Ray jobs on EKS, hence wanted to test on my local kubernetes cluster first.

Are there any known issues with using Ray cluster on docker ?

@VD23 Not that I know of. Many Ray users use Ray on K8s.
Managed Kubernetes services — Ray 2.3.0.

Give it a try on AWS, if you have access to it, and let us know of any issues