State API `list_tasks` stops working for long running jobs

Jonas_Verschueren · May 9, 2025, 9:39am

1. Severity of the issue: (select one)
High: Completely blocks me.

2. Environment:

Ray version: 2.43
Python version: 3.11
OS: linux amazon ami
Cloud/Infrastructure: AWS
Other libs/tools (if relevant):

3. What happened vs. what you expected:

Expected: The list_tasks API method keeps working for long running tasks with raise_on_missing_output=True
Actual: The list_tasks method does not work for long running tasks with raise_on_missing_output=True due to data truncation.

We run training jobs consisting of 3 distinct tasks in a ray cluster. These jobs can run for up to 13-14 hours. Upon completion of the job we retrieve and store the task details of the job in external storage for later inspection and monitoring purposes.

Unfortunately, we are experiencing problems with jobs that run for around 3 hours and more as the list_tasks State API endpoint with raise_on_missing_output=True raises exceptions due to data truncation when we retrieve the task details of a particular job:

(retrieved by ssh’ing into the head node of the ray cluster).

I suspect the cause of this may be twofold:

It appears the filtering mechanism only kicks in after the necessary data has been retrieved server-side which makes it impossible to work around our problem using additional filters. I suspect this is the case because a list_tasks for a particular job can start failing when other jobs are executed on the same cluster, even with very specific filters applied to the query.
For long-running jobs, the vast majority of tasks that are returned are JobSupervisor.ping (see screenshot above) which ideally we’d like to ignore. I don’t see any use for us storing these in the GCS or querying them later. Is there any way to ignore these or decrease the frequency with which these tasks are spawned (ray/python/ray/dashboard/modules/job/job_manager.py at ray-2.43.0 · ray-project/ray · GitHub)?

I did some digging and I understand there are hard limits in ray around the number of tasks that can be returned by a State API query due to there not being any pagination or gRPC streaming endpoints. I also understand that implementing either of these is a non-trivial amount of work.

Any help would be greatly appreciated!

Jonas_Verschueren · May 9, 2025, 9:41am

As a new user I was only allowed to embed one image in my post but I have another one that I think is relevant to demonstrate the quantity of JobSupervisor.ping tasks that are created.
T his is a Grafana plot from the Core ray dashboard where we have 3 jobs running:

israbbani · May 15, 2025, 4:30pm

Hey Jonas, we’ve created a PR to JobSupervisor.ping tasks. It should be available in the next release.

Jonas_Verschueren · May 15, 2025, 5:03pm

Great! Thank you for your efforts.

Is there an estimated timeline yet for when the next ray release might happen?

israbbani · May 15, 2025, 5:06pm

There isn’t a set timeline afaik b/c there are a lot of moving parts, but the releases page should give you a rough sense of how often it happens. Usually 1-2 times a month.

Topic		Replies	Views
Is there a Ray task limit? Ray Core	6	1486	May 14, 2025
Ray Task count? Dashboard, Monitoring & Debugging	10	1277	March 30, 2023
Ray.get doesn't return results even all the tasks are finished Ray Core	6	463	May 11, 2023
[Core] Task Status Check Failure in Ray Data Job with Preempted Workers Ray Clusters	2	33	April 23, 2025
Ray tasks lost on node failiure, how to debug? Ray Core	5	633	June 17, 2021

State API `list_tasks` stops working for long running jobs

Related topics