Model training remain idle for 12hrs!

siva14guru · August 23, 2022, 12:41pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

we started training a model in ray cluster of 2 workers
model was trained for several epochs. suddenly it remains idle, neither trained further nor stopped. the ray cluster also remains idle for more than 12 hrs
kindly help!

siva14guru · August 23, 2022, 12:47pm

ran the model for 12 epoch , completed 7th epoch
these are the logs after 7 epoch
[1m[2m[1m[36m(BaseWorkerMixin pid=78, ip=10.0.5.179)[1m[0m E0823 01:29:22.262246641 110 fork_posix.cc:76] Other threads are currently calling into gRPC, skippin
g fork() handlers
[1m[2m[1m[36m(BaseWorkerMixin pid=107, ip=10.0.5.91)[1m[0m E0823 01:37:50.677617816 139 fork_posix.cc:76] Other threads are currently calling into gRPC, skippin
g fork() handlers
[1m[2m[1m[36m(BaseWorkerMixin pid=107, ip=10.0.5.91)[1m[0m E0823 01:37:50.706992479 139 fork_posix.cc:76] Other threads are currently calling into gRPC, skippin
g fork() handlers
[1m[2m[1m[36m(BaseWorkerMixin pid=107, ip=10.0.5.91)[1m[0m E0823 01:37:50.736008839 139 fork_posix.cc:76] Other threads are currently calling into gRPC, skippin
g fork() handlers
[1m[2m[1m[36m(BaseWorkerMixin pid=107, ip=10.0.5.91)[1m[0m E0823 01:37:50.762878088 139 fork_posix.cc:76] Other threads are currently calling into gRPC, skippin
g fork() handlers
[1m[2m[1m[36m(BaseWorkerMixin pid=78, ip=10.0.5.179)[1m[0m E0823 01:38:25.176254101 110 fork_posix.cc:76] Other threads are currently calling into gRPC, skippin
g fork() handlers
[1m[2m[1m[36m(BaseWorkerMixin pid=78, ip=10.0.5.179)[1m[0m E0823 01:38:25.264617500 110 fork_posix.cc:76] Other threads are currently calling into gRPC, skippin
g fork() handlers
[1m[2m[1m[36m(BaseWorkerMixin pid=78, ip=10.0.5.179)[1m[0m E0823 01:38:25.295523115 110 fork_posix.cc:76] Other threads are currently calling into gRPC, skippin
g fork() handlers

it was received 10 hrs ago
cluster and model training is still running

amogkam · August 23, 2022, 4:01pm

Hey @siva14guru, do you mind sharing a reproducible script/the code that you are using?

Are you seeing that some epochs of training have completed, but is then hanging after that? Or does training not even start at all?

siva14guru · August 23, 2022, 4:06pm

Hey @amogkam not able to share scripts now, it ran for 7 epochs and stored its checkpoint, after that it got hanged

amogkam · August 23, 2022, 4:15pm

Thanks for the info! Is this Pytorch, Tensorflow, or Horovod? Are you using the new Ray 2.0 Trainer APIs or the APIs from Ray 1.x? What is your cluster configuration- are you training on GPUs across multiple nodes?

A script would be helpful if you are able to provide!

siva14guru · August 23, 2022, 4:26pm

framework pytorch 1.8.1
ray 1.12.1
running cluster using kuberay operator on kubernetes
head node is 2 cpu and 6 GB ram
with 2 workers of 8cpu 32 GB ram
training only on cpu
training fasterRCNN model

not able to share right now will make it ready and asap
if you share your insights on what might went wrong will be helpfull

siva14guru · September 12, 2022, 11:29am

hey whenever the memory of the workers went peak training becomes idle

ran in two workers one worker’s memory consumption was increasing and went to the peak, and another worker’s memory consumption almost remains the same. not sure why it is happening. after reaching the peak the cluster becomes idle

siva14guru · September 12, 2022, 4:56pm

code to reproduce GitHub - sivaguru-pat/detection-ray

kai · September 19, 2022, 9:03am

If the cluster becomes idle after reaching peak memory usage, this could indicate that Ray is spilling objects to disk, which takes time, and could lead to failures down the line if an unstable/slow fs like e.g. cloud storage is used. The best thing here would be to ensure that your workers have enough memory for the model and training to fit in - can you just use larger instances?

Topic		Replies	Views
After running ray for a long time, it shows that the worker has been killed	0	38	June 10, 2024
[RaySGD] Training instability Ray Train	6	1054	March 17, 2021
Remote worker nodes only alive for 30 seconds Ray Clusters	7	1604	April 24, 2025
Basic Group and Aggregate task unable to complete	0	202	October 22, 2023
Model loaded to GPU memory but GPU memory is not being utilized RLlib	5	766	November 29, 2022

Model training remain idle for 12hrs!

Related topics