Hi Ray experts,
We are running Ray 2.7.0 using external redis cache for the GCS. We can submit jobs and view historical jobs on the job page of the dashboard UI. However, when we faked the head node failure by killing the head pod “kubectl delete pod {head_pod_name}”, the job page hangs after the head node is started. It takes a very long time for the page the load the job list. We also found that the issue is gone if the redis data are cleaned via the “FLUSHALL” command.
We need help debugging this issue further.
Thanks!
Mingshi