Job API is very slow when using external redis

Hi Ray experts,

We are running Ray 2.7.0 using external redis cache for the GCS. We can submit jobs and view historical jobs on the job page of the dashboard UI. However, when we faked the head node failure by killing the head pod “kubectl delete pod {head_pod_name}”, the job page hangs after the head node is started. It takes a very long time for the page the load the job list. We also found that the issue is gone if the redis data are cleaned via the “FLUSHALL” command.

We need help debugging this issue further.


This seems a bug? cc: @Kai-Hsun_Chen @sangcho

This issue is seen by other users (see this thread Slack), so this does seem like a bug. It would be great if someone can offer help here.

Thanks again!

I will reply in the Slack thread.