Model training remain idle for 12hrs!

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

we started training a model in ray cluster of 2 workers
model was trained for several epochs. suddenly it remains idle, neither trained further nor stopped. the ray cluster also remains idle for more than 12 hrs
kindly help!

ran the model for 12 epoch , completed 7th epoch
these are the logs after 7 epoch
[1m[2m[1m[36m(BaseWorkerMixin pid=78, ip=10.0.5.179)[1m[0m E0823 01:29:22.262246641 110 fork_posix.cc:76] Other threads are currently calling into gRPC, skippin
g fork() handlers
[1m[2m[1m[36m(BaseWorkerMixin pid=107, ip=10.0.5.91)[1m[0m E0823 01:37:50.677617816 139 fork_posix.cc:76] Other threads are currently calling into gRPC, skippin
g fork() handlers
[1m[2m[1m[36m(BaseWorkerMixin pid=107, ip=10.0.5.91)[1m[0m E0823 01:37:50.706992479 139 fork_posix.cc:76] Other threads are currently calling into gRPC, skippin
g fork() handlers
[1m[2m[1m[36m(BaseWorkerMixin pid=107, ip=10.0.5.91)[1m[0m E0823 01:37:50.736008839 139 fork_posix.cc:76] Other threads are currently calling into gRPC, skippin
g fork() handlers
[1m[2m[1m[36m(BaseWorkerMixin pid=107, ip=10.0.5.91)[1m[0m E0823 01:37:50.762878088 139 fork_posix.cc:76] Other threads are currently calling into gRPC, skippin
g fork() handlers
[1m[2m[1m[36m(BaseWorkerMixin pid=78, ip=10.0.5.179)[1m[0m E0823 01:38:25.176254101 110 fork_posix.cc:76] Other threads are currently calling into gRPC, skippin
g fork() handlers
[1m[2m[1m[36m(BaseWorkerMixin pid=78, ip=10.0.5.179)[1m[0m E0823 01:38:25.264617500 110 fork_posix.cc:76] Other threads are currently calling into gRPC, skippin
g fork() handlers
[1m[2m[1m[36m(BaseWorkerMixin pid=78, ip=10.0.5.179)[1m[0m E0823 01:38:25.295523115 110 fork_posix.cc:76] Other threads are currently calling into gRPC, skippin
g fork() handlers

it was received 10 hrs ago
cluster and model training is still running

Hey @siva14guru, do you mind sharing a reproducible script/the code that you are using?

Are you seeing that some epochs of training have completed, but is then hanging after that? Or does training not even start at all?

Hey @amogkam not able to share scripts now, it ran for 7 epochs and stored its checkpoint, after that it got hanged

Thanks for the info! Is this Pytorch, Tensorflow, or Horovod? Are you using the new Ray 2.0 Trainer APIs or the APIs from Ray 1.x? What is your cluster configuration- are you training on GPUs across multiple nodes?

A script would be helpful if you are able to provide!

framework pytorch 1.8.1
ray 1.12.1
running cluster using kuberay operator on kubernetes
head node is 2 cpu and 6 GB ram
with 2 workers of 8cpu 32 GB ram
training only on cpu
training fasterRCNN model

not able to share right now will make it ready and asap
if you share your insights on what might went wrong will be helpfull

hey whenever the memory of the workers went peak training becomes idle

ran in two workers one worker’s memory consumption was increasing and went to the peak, and another worker’s memory consumption almost remains the same. not sure why it is happening. after reaching the peak the cluster becomes idle

code to reproduce GitHub - sivaguru-pat/detection-ray

If the cluster becomes idle after reaching peak memory usage, this could indicate that Ray is spilling objects to disk, which takes time, and could lead to failures down the line if an unstable/slow fs like e.g. cloud storage is used. The best thing here would be to ensure that your workers have enough memory for the model and training to fit in - can you just use larger instances?