Ray jobs failing after 250 jobs

Prasanth_Muddarsi · February 27, 2023, 5:00am

Hi,
I’m trying to submit jobs to a AWS ray cluster with t3.medium head node and 6 workers.
Jobs are getting succeeded as expected but after 250 jobs, All the jobs are failing with message -

Unexpected error occurred: The actor died unexpectedly before finishing the task.

Although, later 250 Jobs are getting succeeded after restarting the cluster.

Each job of mine includes an Actor and 7 tasks called remotely and results returned using ray.get().

Is there any way to resolve this.

Topic		Replies	Views
Ray Head restarting and leaving behind zombie processes Ray Clusters	0	135	March 12, 2024
Ray Actor Dying unexpectedly Ray Core	8	3760	October 21, 2022
How to understand the "unexpected error" messaging	3	86	November 6, 2024
Error while stopping a job in a ray cluster Check failed: addr_proto.worker_id() != "" Ray Clusters	0	11	June 30, 2024
Big cluster job failing due to SIGBUS in plasma Ray Core	16	929	July 12, 2021

Ray jobs failing after 250 jobs

Related topics