A worker died or was killed while executing a task by an unexpected system error

  • High: It blocks me to complete my task.

One trial runs and then crashes or no trials run and crashes.

WARNING worker.py:1986 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff50d7cfb6bf945b5b554fbab201000000 Worker ID: 54586409b231cd3d2422bf201389ac4af51180d94a4036a2822b5263 Node ID: 925c3b1cf0a080564fb69daa0c734e88e3d3fcdc95301734583409ba Worker IP address: 127.0.0.1 Worker port: 51829 Worker PID: 18920 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 10054. An existing connection was forcibly closed by the remote host. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

The only thing that I can see that goes to 100% is the GPU usage. I can run all of this in parallel using ray core, but sometimes it hangs there as well. I have plenty of memory and this is happening with only a few processors and I have 20.

I have a few functions that get called behind the scenes, so perhaps I should be using ray train for the trainable function?

Any help will be much appreciated. I really like the idea of this platform.

Here is my setup:

if __name__ == '__main__':
	
	ray.shutdown()
	ray.init(_temp_dir=f"{os.getcwd()}/results/xg_tree", ignore_reinit_error=True)

	searcher = OptunaSearch(space=sample_params_ray, metric=eval_metric, mode="max")
	algo = ConcurrencyLimiter(searcher, max_concurrent=num_parallel)
	objective_func = tune.with_parameters(objective_ray, other_params=other_params, save_name=save_name)
	objective_resources = tune.with_resources(objective_func, resources={"gpu": 1, "cpu": num_parallel*2, "memory": 3e10})


	tuner = tune.Tuner(

		objective_resources,

		tune_config=tune.TuneConfig(
			search_alg=algo, 
			num_samples=num_trials),

		run_config=air.RunConfig(
	# 		# Set Ray Tune verbosity. Print summary table only with levels 2 or 3.
			verbose=0
	)
)
	results = tuner.fit()

Hi @sjmitche9,

it’s hard to tell what’s going on without more details. What happens if you restrict the number of parallelism? How large is the data you’re loading? How much RAM is on your machine?

Is there any more output? How does the log output from Ray Tune look like?

Also, can you share which version of Ray you’re using?

Hi @kai. Thanks for getting back to me so fast. I have it set on only 2 tasks or actors right now (sorry but I’m not familiar with the correct terminology yet). The data size doesn’t seem to matter, because I tried reducing it to something very small, but that didn’t help. One thing I’m unsure of though is if I need to be loading data with Ray or not. I read a few csvs, do a bunch of calculations, and pass it to an xgboost classifier. I suspect that it’s crashing before the training function gets called. I have 64GB of RAM. Ray version is 2.4.0. There has been some weird output from time to time but it looked odd to me (I’ll see if I can reproduce it). I’ll post the dashboard agent and the log monitor output. Thank you for your help!

2023-04-30 23:22:44,843 INFO agent.py:142 – Dashboard agent grpc address: 127.0.0.1:63754
2023-04-30 23:22:44,843 INFO utils.py:112 – Get all modules by type: DashboardAgentModule
2023-04-30 23:22:45,581 INFO utils.py:145 – Available modules: [<class ‘ray.dashboard.modules.event.event_agent.EventAgent’>, <class ‘ray.dashboard.modules.healthz.healthz_agent.HealthzAgent’>, <class ‘ray.dashboard.modules.job.job_agent.JobAgent’>, <class ‘ray.dashboard.modules.log.log_agent.LogAgent’>, <class ‘ray.dashboard.modules.log.log_agent.LogAgentV1Grpc’>, <class ‘ray.dashboard.modules.reporter.reporter_agent.ReporterAgent’>, <class ‘ray.dashboard.modules.runtime_env.runtime_env_agent.RuntimeEnvAgent’>, <class ‘ray.dashboard.modules.serve.serve_agent.ServeAgent’>]
2023-04-30 23:22:45,581 INFO agent.py:171 – Loading DashboardAgentModule: <class ‘ray.dashboard.modules.event.event_agent.EventAgent’>
2023-04-30 23:22:45,581 INFO event_agent.py:38 – Event agent cache buffer size: 10240
2023-04-30 23:22:45,581 INFO agent.py:171 – Loading DashboardAgentModule: <class ‘ray.dashboard.modules.healthz.healthz_agent.HealthzAgent’>
2023-04-30 23:22:45,581 INFO agent.py:171 – Loading DashboardAgentModule: <class ‘ray.dashboard.modules.job.job_agent.JobAgent’>
2023-04-30 23:22:45,581 INFO agent.py:171 – Loading DashboardAgentModule: <class ‘ray.dashboard.modules.log.log_agent.LogAgent’>
2023-04-30 23:22:45,749 INFO agent.py:171 – Loading DashboardAgentModule: <class ‘ray.dashboard.modules.log.log_agent.LogAgentV1Grpc’>
2023-04-30 23:22:45,749 INFO agent.py:171 – Loading DashboardAgentModule: <class ‘ray.dashboard.modules.reporter.reporter_agent.ReporterAgent’>
2023-04-30 23:22:45,757 INFO agent.py:171 – Loading DashboardAgentModule: <class ‘ray.dashboard.modules.runtime_env.runtime_env_agent.RuntimeEnvAgent’>
2023-04-30 23:22:45,757 INFO agent.py:171 – Loading DashboardAgentModule: <class ‘ray.dashboard.modules.serve.serve_agent.ServeAgent’>
2023-04-30 23:22:45,757 INFO agent.py:176 – Loaded 8 modules.
2023-04-30 23:22:45,765 INFO http_server_agent.py:74 – Dashboard agent http address: 127.0.0.1:52365
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [GET] <PlainResource /api/local_raylet_healthz> → <function HealthzAgent.health_check at 0x0000021711F31790>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <PlainResource /api/local_raylet_healthz> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x00000217124AFA90>>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [POST] <PlainResource /api/job_agent/jobs/> → <function JobAgent.submit_job at 0x0000021711F77700>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <PlainResource /api/job_agent/jobs/> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x00000217124AFA90>>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [POST] <DynamicResource /api/job_agent/jobs/{job_or_submission_id}/stop> → <function JobAgent.stop_job at 0x0000021711FE7AF0>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <DynamicResource /api/job_agent/jobs/{job_or_submission_id}/stop> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x00000217124AFA90>>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [DELETE] <DynamicResource /api/job_agent/jobs/{job_or_submission_id}> → <function JobAgent.delete_job at 0x0000021711FE7CA0>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <DynamicResource /api/job_agent/jobs/{job_or_submission_id}> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x00000217124AFA90>>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [GET] <DynamicResource /api/job_agent/jobs/{job_or_submission_id}/logs> → <function JobAgent.get_job_logs at 0x0000021711FE7E50>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <DynamicResource /api/job_agent/jobs/{job_or_submission_id}/logs> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x00000217124AFA90>>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [GET] <DynamicResource /api/job_agent/jobs/{job_or_submission_id}/logs/tail> → <function JobAgent.tail_job_logs at 0x0000021711F70040>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <DynamicResource /api/job_agent/jobs/{job_or_submission_id}/logs/tail> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x00000217124AFA90>>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [GET] <PlainResource /api/ray/version> → <function ServeAgent.get_version at 0x0000021710D5CD30>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <PlainResource /api/ray/version> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x00000217124AFA90>>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [GET] <PlainResource /api/serve/deployments/> → <function ServeAgent.get_all_deployments at 0x0000021710D5CDC0>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <PlainResource /api/serve/deployments/> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x00000217124AFA90>>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [GET] <PlainResource /api/serve/applications/> → <function ServeAgent.get_serve_instance_details at 0x0000021710D5CF70>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <PlainResource /api/serve/applications/> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x00000217124AFA90>>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [GET] <PlainResource /api/serve/deployments/status> → <function ServeAgent.get_all_deployment_statuses at 0x0000021710D65160>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <PlainResource /api/serve/deployments/status> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x00000217124AFA90>>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [DELETE] <PlainResource /api/serve/deployments/> → <function ServeAgent.delete_serve_application at 0x0000021710D65310>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <PlainResource /api/serve/deployments/> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x00000217124AFA90>>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [DELETE] <PlainResource /api/serve/applications/> → <function ServeAgent.delete_serve_applications at 0x0000021710D654C0>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <PlainResource /api/serve/applications/> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x00000217124AFA90>>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [PUT] <PlainResource /api/serve/deployments/> → <function ServeAgent.put_all_deployments at 0x0000021710D65670>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <PlainResource /api/serve/deployments/> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x00000217124AFA90>>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [PUT] <PlainResource /api/serve/applications/> → <function ServeAgent.put_all_applications at 0x0000021710D65820>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <PlainResource /api/serve/applications/> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x00000217124AFA90>>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [GET] <StaticResource /logs → WindowsPath(‘C:/Users/sjmit/My Drive/all_five/results/xg_tree/session_2023-04-30_23-22-40_078301_16736/logs’)> → <bound method StaticResource._handle of <StaticResource /logs → WindowsPath(‘C:/Users/sjmit/My Drive/all_five/results/xg_tree/session_2023-04-30_23-22-40_078301_16736/logs’)>>
2023-04-30 23:22:45,765 INFO http_server_agent.py:81 – <ResourceRoute [OPTIONS] <StaticResource /logs → WindowsPath(‘C:/Users/sjmit/My Drive/all_five/results/xg_tree/session_2023-04-30_23-22-40_078301_16736/logs’)> → <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x00000217124AFA90>>
2023-04-30 23:22:45,765 INFO http_server_agent.py:82 – Registered 30 routes.
2023-04-30 23:22:45,773 INFO event_agent.py:56 – Report events to 127.0.0.1:63660
2023-04-30 23:22:45,773 INFO event_utils.py:132 – Monitor events logs modified after 1682920365.2050407 on c:\Users\sjmit\My Drive\all_five/results/xg_tree\session_2023-04-30_23-22-40_078301_16736\logs\events, the source types are all.
2023-04-30 23:22:44,957 INFO log_monitor.py:250 – Beginning to track file raylet.err
2023-04-30 23:22:44,957 INFO log_monitor.py:250 – Beginning to track file gcs_server.err
2023-04-30 23:22:44,957 INFO log_monitor.py:250 – Beginning to track file monitor.log
2023-04-30 23:22:45,299 INFO log_monitor.py:250 – Beginning to track file worker-e187655a57e1520883d43563a713152ed0254b17b62e05bc450625d3-ffffffff-20812.err
2023-04-30 23:22:45,299 INFO log_monitor.py:250 – Beginning to track file worker-e187655a57e1520883d43563a713152ed0254b17b62e05bc450625d3-ffffffff-20812.out
2023-04-30 23:22:45,299 INFO log_monitor.py:250 – Beginning to track file worker-fb91dbe86bdab91c41903cb5b31a42766ec79595016ff333ebd7ef26-ffffffff-21376.err
2023-04-30 23:22:45,299 INFO log_monitor.py:250 – Beginning to track file worker-fb91dbe86bdab91c41903cb5b31a42766ec79595016ff333ebd7ef26-ffffffff-21376.out
2023-04-30 23:22:45,409 INFO log_monitor.py:250 – Beginning to track file worker-26df0666720ffbfd0a13a4bf39e17f0b9eb039abdc31e86da2d291c3-ffffffff-16896.err
2023-04-30 23:22:45,409 INFO log_monitor.py:250 – Beginning to track file worker-26df0666720ffbfd0a13a4bf39e17f0b9eb039abdc31e86da2d291c3-ffffffff-16896.out
2023-04-30 23:22:45,409 INFO log_monitor.py:250 – Beginning to track file worker-f6b8cb7067af851301fd9c844aae740f0e19f66bac376d4f627572a6-ffffffff-7316.err
2023-04-30 23:22:45,409 INFO log_monitor.py:250 – Beginning to track file worker-f6b8cb7067af851301fd9c844aae740f0e19f66bac376d4f627572a6-ffffffff-7316.out
2023-04-30 23:22:45,519 INFO log_monitor.py:250 – Beginning to track file worker-0c8b4cee032ffc9b99f4e5e737bd7cb9bd1056972a145a427c918d14-ffffffff-15132.err
2023-04-30 23:22:45,519 INFO log_monitor.py:250 – Beginning to track file worker-0c8b4cee032ffc9b99f4e5e737bd7cb9bd1056972a145a427c918d14-ffffffff-15132.out
2023-04-30 23:22:45,519 INFO log_monitor.py:250 – Beginning to track file worker-12de301358603ce5ae36d2a16709b84204cd509ee09338503036b1b1-ffffffff-17240.err
2023-04-30 23:22:45,519 INFO log_monitor.py:250 – Beginning to track file worker-12de301358603ce5ae36d2a16709b84204cd509ee09338503036b1b1-ffffffff-17240.out
2023-04-30 23:22:45,519 INFO log_monitor.py:250 – Beginning to track file worker-6f31a02af5ddbc61dc5fbb5b68eb5800c3adfec347d9fa191d2d8bde-ffffffff-10032.err
2023-04-30 23:22:45,519 INFO log_monitor.py:250 – Beginning to track file worker-6f31a02af5ddbc61dc5fbb5b68eb5800c3adfec347d9fa191d2d8bde-ffffffff-10032.out
2023-04-30 23:22:45,519 INFO log_monitor.py:250 – Beginning to track file worker-72d1d43b77d523b01e46334e0a7a2e6afd99b5cd20ed9e77e23f2971-ffffffff-15992.err
2023-04-30 23:22:45,519 INFO log_monitor.py:250 – Beginning to track file worker-72d1d43b77d523b01e46334e0a7a2e6afd99b5cd20ed9e77e23f2971-ffffffff-15992.out
2023-04-30 23:22:45,519 INFO log_monitor.py:250 – Beginning to track file worker-78db3e131f7dd763dcd4b16d61100ce84ef01773663158e4564eb859-ffffffff-20456.err
2023-04-30 23:22:45,519 INFO log_monitor.py:250 – Beginning to track file worker-78db3e131f7dd763dcd4b16d61100ce84ef01773663158e4564eb859-ffffffff-20456.out
2023-04-30 23:22:45,519 INFO log_monitor.py:250 – Beginning to track file worker-7cdf3646426dcc1f3b2db6599087af72b0e4bb55f9891e24f34840a0-ffffffff-6292.err
2023-04-30 23:22:45,519 INFO log_monitor.py:250 – Beginning to track file worker-7cdf3646426dcc1f3b2db6599087af72b0e4bb55f9891e24f34840a0-ffffffff-6292.out
2023-04-30 23:22:45,519 INFO log_monitor.py:250 – Beginning to track file worker-86306bcb8f6f522d411de0323358eff64861cd0c05656f6c93916caf-ffffffff-10488.err
2023-04-30 23:22:45,519 INFO log_monitor.py:250 – Beginning to track file worker-86306bcb8f6f522d411de0323358eff64861cd0c05656f6c93916caf-ffffffff-10488.out
2023-04-30 23:22:45,519 INFO log_monitor.py:250 – Beginning to track file worker-9049323602a7d2fc52d21682cdce0bc531dc5413c3850232357d7660-ffffffff-17888.err
2023-04-30 23:22:45,519 INFO log_monitor.py:250 – Beginning to track file worker-9049323602a7d2fc52d21682cdce0bc531dc5413c3850232357d7660-ffffffff-17888.out
2023-04-30 23:22:45,519 INFO log_monitor.py:250 – Beginning to track file worker-e3da6a896b55ba3d9f1c1ea1674c212cd7e272cca70699c26e72790d-ffffffff-404.err
2023-04-30 23:22:45,519 INFO log_monitor.py:250 – Beginning to track file worker-e3da6a896b55ba3d9f1c1ea1674c212cd7e272cca70699c26e72790d-ffffffff-404.out
2023-04-30 23:22:45,519 INFO log_monitor.py:250 – Beginning to track file worker-fb486449506d8a6b3fb71646c806ad9bd1337f099ead4adc2ccc50c3-ffffffff-388.err
2023-04-30 23:22:45,519 INFO log_monitor.py:250 – Beginning to track file worker-fb486449506d8a6b3fb71646c806ad9bd1337f099ead4adc2ccc50c3-ffffffff-388.out
2023-04-30 23:22:45,628 INFO log_monitor.py:250 – Beginning to track file worker-a5c5bb7a14513d764a5686871007de55b355a95748190fb3852ba335-ffffffff-3756.err
2023-04-30 23:22:45,628 INFO log_monitor.py:250 – Beginning to track file worker-a5c5bb7a14513d764a5686871007de55b355a95748190fb3852ba335-ffffffff-3756.out
2023-04-30 23:22:45,628 INFO log_monitor.py:250 – Beginning to track file worker-bb09cf3012a5613649e2899b809f4d0e9c1f0d9d50a16d4a84e4eb3b-ffffffff-19328.err
2023-04-30 23:22:45,628 INFO log_monitor.py:250 – Beginning to track file worker-bb09cf3012a5613649e2899b809f4d0e9c1f0d9d50a16d4a84e4eb3b-ffffffff-19328.out
2023-04-30 23:22:45,628 INFO log_monitor.py:250 – Beginning to track file worker-c375ba43593542db87a45e110842ee1640b58f8b3a7e45ad03eb5043-ffffffff-14272.err
2023-04-30 23:22:45,628 INFO log_monitor.py:250 – Beginning to track file worker-c375ba43593542db87a45e110842ee1640b58f8b3a7e45ad03eb5043-ffffffff-14272.out
2023-04-30 23:22:45,628 INFO log_monitor.py:250 – Beginning to track file worker-f3158f4f82c981b3755a98717d900ab64e307f8ce9a49188a6952488-ffffffff-7836.err
2023-04-30 23:22:45,628 INFO log_monitor.py:250 – Beginning to track file worker-f3158f4f82c981b3755a98717d900ab64e307f8ce9a49188a6952488-ffffffff-7836.out
2023-04-30 23:22:45,749 INFO log_monitor.py:250 – Beginning to track file worker-b4c171be5f81deecc5034cb3a67ea8d5281aa44575e7d16c7978237f-ffffffff-16180.err
2023-04-30 23:22:45,749 INFO log_monitor.py:250 – Beginning to track file worker-b4c171be5f81deecc5034cb3a67ea8d5281aa44575e7d16c7978237f-ffffffff-16180.out
2023-04-30 23:22:45,749 INFO log_monitor.py:250 – Beginning to track file worker-f4bccf192011ddcd9da22641abe7ec349275e34d6364ac2e9ed034ef-ffffffff-15140.err
2023-04-30 23:22:45,749 INFO log_monitor.py:250 – Beginning to track file worker-f4bccf192011ddcd9da22641abe7ec349275e34d6364ac2e9ed034ef-ffffffff-15140.out

@kai Sorry if I’ve posted meaningless information. I figure more is better than less here. I removed the trainable function here, so this is all during my preprocessing, which is really not much. I suspect that I’m violating a Ray rule or guideline or something. Do functions outside the objective function need the @ray.remote decorator? Thanks again for your help!

2023-05-02 12:14:55,011 INFO worker.py:1616 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
[I 2023-05-02 12:14:57,957] A new study created in memory with name: optuna
== Status ==
Current time: 2023-05-02 12:15:02 (running for 00:00:04.99)
Using FIFO scheduling algorithm.
Logical resource usage: 4.0/20 CPUs, 1.0/1 GPUs (0.0/1.0 accelerator_type:G)
Result logdir: C:\Users\sjmit\ray_results\objective_ray_2023-05-02_12-14-57
Number of trials: 1/10 (1 RUNNING)


2023-05-02 12:15:03,258 WARNING worker.py:1986 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffb6ee79d7763d28e0c37afbab01000000 Worker ID: 59e2e494e42cd021c9954bdbf67254a3a2c5be4e2bc117172837a057 Node ID: 7c75d811b2a597e09b8539f0824480d13fa5286dc6364023172d884f Worker IP address: 127.0.0.1 Worker port: 52486 Worker PID: 21636 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly. Worker exits with an exit code None.
2023-05-02 12:15:03,289 ERROR trial_runner.py:1450 -- Trial objective_ray_694ea3b0: Error happened when processing _ExecutorEventType.TRAINING_RESULT.
ray.tune.error._TuneNoNextExecutorEventError: Traceback (most recent call last):
  File "C:\Users\sjmit\AppData\Local\Programs\Python\Python39\lib\site-packages\ray\tune\execution\ray_trial_executor.py", line 1231, in get_next_executor_event
    future_result = ray.get(ready_future)
  File "C:\Users\sjmit\AppData\Local\Programs\Python\Python39\lib\site-packages\ray\_private\client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\sjmit\AppData\Local\Programs\Python\Python39\lib\site-packages\ray\_private\worker.py", line 2523, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
        class_name: ImplicitFunc
        actor_id: b6ee79d7763d28e0c37afbab01000000
        pid: 21636
        namespace: a6137a64-1c50-4369-a63b-f5207c4ca9da
        ip: 127.0.0.1
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly. Worker exits with an exit code None.

== Status ==
Current time: 2023-05-02 12:15:03 (running for 00:00:05.30)
Using FIFO scheduling algorithm.
Logical resource usage: 0/20 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:G)
Result logdir: C:\Users\sjmit\ray_results\objective_ray_2023-05-02_12-14-57
Number of trials: 2/10 (2 PENDING)
Number of errored trials: 1

Hi @kai ,
Just wondering if the new information is helpful at all. Thank you!

Hi @sjmitche9 and apologies for the late reply.

Unfortunately we can’t see very much in the error message. Can you maybe share your objective_ray function?

Just as a general comment, we don’t test our ML libraries extensively with Windows. Even using WSL could lead to improvements here.