[Tune] Timesteps total gets reset everytime ‘num_healthy_workers’ goes down

Hi!

Ray: v1.0.1
Tensorflow: 2.0
Python: 3.6.9
Ubuntu: 16.04
Head node VM: 64 cores, 504 Gb memory

In my tensorboard plots, every time a worker is blacklisted, the timesteps_total value gets reset and I get these zig zag plots against number of training steps. Here is the step-based TB plot for num_healthy_workers :
image

Plotting training against wall clock time, however, gives me linear plots, but you can see training has been restored, so it affects the training performance. Do you happen to know what might cause timesteps_total to reset back every time num_healthy_workers decreases?

Below I attached a list of how the timesteps_total gets reset between training iterations:
Training iteration: 1, Timesteps total: 12000
Training iteration: 2, Timesteps total: 24000
Training iteration: 3, Timesteps total: 36000
Training iteration: 4, Timesteps total: 48000
Training iteration: 5, Timesteps total: 60000
Training iteration: 6, Timesteps total: 72000
Training iteration: 7, Timesteps total: 84000
Training iteration: 8, Timesteps total: 96000
Training iteration: 9, Timesteps total: 108000
Training iteration: 10, Timesteps total: 120000
Training iteration: 11, Timesteps total: 11600

Training iteration: 12, Timesteps total: 23200
Training iteration: 13, Timesteps total: 34800
Training iteration: 14, Timesteps total: 46400
Training iteration: 15, Timesteps total: 11400
Training iteration: 16, Timesteps total: 22800
Training iteration: 17, Timesteps total: 34200
Training iteration: 18, Timesteps total: 45600
Training iteration: 19, Timesteps total: 57000
Training iteration: 20, Timesteps total: 68400
Training iteration: 21, Timesteps total: 11200

Training iteration: 22, Timesteps total: 22400
Training iteration: 23, Timesteps total: 33600
Training iteration: 24, Timesteps total: 44800
Training iteration: 25, Timesteps total: 56000
Training iteration: 26, Timesteps total: 67200
Training iteration: 27, Timesteps total: 78400

Many thanks!

Hi @RalucaGeorgescu, could you try setting checkpointing with Tune? You can do so with tune.run(checkpoint_freq=N.

Let me know if that helps!

Also, are there other logs that show why the number of healthy worker decreases?