State API `list_tasks` stops working for long running jobs

1. Severity of the issue: (select one)
High: Completely blocks me.

2. Environment:

  • Ray version: 2.43
  • Python version: 3.11
  • OS: linux amazon ami
  • Cloud/Infrastructure: AWS
  • Other libs/tools (if relevant):

3. What happened vs. what you expected:

  • Expected: The list_tasks API method keeps working for long running tasks with raise_on_missing_output=True
  • Actual: The list_tasks method does not work for long running tasks with raise_on_missing_output=True due to data truncation.

We run training jobs consisting of 3 distinct tasks in a ray cluster. These jobs can run for up to 13-14 hours. Upon completion of the job we retrieve and store the task details of the job in external storage for later inspection and monitoring purposes.

Unfortunately, we are experiencing problems with jobs that run for around 3 hours and more as the list_tasks State API endpoint with raise_on_missing_output=True raises exceptions due to data truncation when we retrieve the task details of a particular job:


(retrieved by ssh’ing into the head node of the ray cluster).

I suspect the cause of this may be twofold:

  1. It appears the filtering mechanism only kicks in after the necessary data has been retrieved server-side which makes it impossible to work around our problem using additional filters. I suspect this is the case because a list_tasks for a particular job can start failing when other jobs are executed on the same cluster, even with very specific filters applied to the query.
  2. For long-running jobs, the vast majority of tasks that are returned are JobSupervisor.ping (see screenshot above) which ideally we’d like to ignore. I don’t see any use for us storing these in the GCS or querying them later. Is there any way to ignore these or decrease the frequency with which these tasks are spawned (ray/python/ray/dashboard/modules/job/job_manager.py at ray-2.43.0 · ray-project/ray · GitHub)?

I did some digging and I understand there are hard limits in ray around the number of tasks that can be returned by a State API query due to there not being any pagination or gRPC streaming endpoints. I also understand that implementing either of these is a non-trivial amount of work.

Any help would be greatly appreciated!

As a new user I was only allowed to embed one image in my post but I have another one that I think is relevant to demonstrate the quantity of JobSupervisor.ping tasks that are created.
T his is a Grafana plot from the Core ray dashboard where we have 3 jobs running: