Hi all,
Due to other constraints I am forced to use python 3.10.2, which means I have to use ray version 1.13 or newer. Previously I used ray version 1.10, which worked fine for me to train in my setup for roughly a week before the gcs made problems.
Now I am using ray 2.0.0 and run into the following problem. After about 2 days, ray seemed to start a new session.
After it started the new session, essentially all jobs are stalling and the head node is getting spammed with messages like: "raylet has lagged heartbeat due to slow network or busy workload … node name ip has been marked dead because the server has missed too many heartbeats from it… ".
I also got one error of the form:
FileNotFounderror: No Such File or Directory: sessionspath/logs/worker-****.out
Any ideas what this is? Why does ray spontaneously start a new session?
When you say that Ray starts a new session, do you mean that the driver restarts or the whole cluster restarts? It seems like this may be a system-level bug or stress issue. Would you mind opening a GitHub issue and we can follow up with you there? it would also help if you could provide more information about your workload and how to reproduce the problem. Thanks!
@thoglu I think your GCS somehow gets restarted. The difference between 1.10 and the latest version is that we don’t use Redis by default. In this setup, everything will be stored in the memory.
Do you mind checking the session dir on the head node to see how many sessions are there and whether the GCS restarted? (there should be GCS logs in each session dirs in the head node).