[Tune] Timesteps total gets reset everytime ‘num_healthy_workers’ goes down

RalucaGeorgescu · December 30, 2020, 12:57pm

Hi!

Ray: v1.0.1
Tensorflow: 2.0
Python: 3.6.9
Ubuntu: 16.04
Head node VM: 64 cores, 504 Gb memory

In my tensorboard plots, every time a worker is blacklisted, the timesteps_total value gets reset and I get these zig zag plots against number of training steps. Here is the step-based TB plot for num_healthy_workers :

Plotting training against wall clock time, however, gives me linear plots, but you can see training has been restored, so it affects the training performance. Do you happen to know what might cause timesteps_total to reset back every time num_healthy_workers decreases?

Below I attached a list of how the timesteps_total gets reset between training iterations:
Training iteration: 1, Timesteps total: 12000
Training iteration: 2, Timesteps total: 24000
Training iteration: 3, Timesteps total: 36000
Training iteration: 4, Timesteps total: 48000
Training iteration: 5, Timesteps total: 60000
Training iteration: 6, Timesteps total: 72000
Training iteration: 7, Timesteps total: 84000
Training iteration: 8, Timesteps total: 96000
Training iteration: 9, Timesteps total: 108000
Training iteration: 10, Timesteps total: 120000
Training iteration: 11, Timesteps total: 11600
Training iteration: 12, Timesteps total: 23200
Training iteration: 13, Timesteps total: 34800
Training iteration: 14, Timesteps total: 46400
Training iteration: 15, Timesteps total: 11400
Training iteration: 16, Timesteps total: 22800
Training iteration: 17, Timesteps total: 34200
Training iteration: 18, Timesteps total: 45600
Training iteration: 19, Timesteps total: 57000
Training iteration: 20, Timesteps total: 68400
Training iteration: 21, Timesteps total: 11200
Training iteration: 22, Timesteps total: 22400
Training iteration: 23, Timesteps total: 33600
Training iteration: 24, Timesteps total: 44800
Training iteration: 25, Timesteps total: 56000
Training iteration: 26, Timesteps total: 67200
Training iteration: 27, Timesteps total: 78400

Many thanks!

rliaw · January 5, 2021, 5:15pm

Hi @RalucaGeorgescu, could you try setting checkpointing with Tune? You can do so with tune.run(checkpoint_freq=N.

Let me know if that helps!

rliaw · January 5, 2021, 5:16pm

Also, are there other logs that show why the number of healthy worker decreases?

Topic		Replies	Views
[RLlib] Timesteps total gets reset everytime 'num_healthy_workers' goes down RLlib	1	259	December 30, 2020
Limit number of steps? Ray Tune	7	1084	May 7, 2022
Some questions about tune	0	377	April 19, 2023
Why ray tune restarts my trainings? Ray Tune	7	408	June 27, 2022
Experiment slowing down after several hours of flawless training Ray Tune	5	543	June 21, 2023

[Tune] Timesteps total gets reset everytime ‘num_healthy_workers’ goes down

Related topics