How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I am using ray job to do some batch data processing. I found abnormal memory increase in head node and the memory used by gcs server. Continuously growing by 8 GB per hour util OOM and the job actually hangs when the abnormal memory increasing began.
This situation didn’t happen from the beginning of job or in some small jobs. This only happened when job ran for more than 6 hours and more than 50000 tasks.
Other related information:
- Using kubernetes manage head node and worker nodes;
- The number of worker nodes is 400.
- Only one job is running when this situation happened.
- Both CPU and GPU tasks exist.
- Worker nodes go online and offline a few times because of OOM and disk pressure.
- log_to_driver == False
- When the usage of gcs memory in head node began to increase abnormally, looks like no more task agent created any more. It looks like this job also hangs.
Anybody can give me some options about this issue? Or how can i debug ? Any available logs can get.