Hi!
Ray: v1.0.1
Tensorflow: 2.0
Python: 3.6.9
Ubuntu: 16.04
Head node VM: 64 cores, 504 Gb memory
In my tensorboard plots, every time a worker is blacklisted, the timesteps_total
value gets reset and I get these zig zag plots against number of training steps. Here is the step-based TB plot for num_healthy_workers
:
Plotting training against wall clock time, however, gives me linear plots, but you can see training has been restored, so it affects the training performance. Do you happen to know what might cause timesteps_total
to reset back every time num_healthy_workers
decreases?
Below I attached a list of how the timesteps_total
gets reset between training iterations:
Training iteration: 1, Timesteps total: 12000
Training iteration: 2, Timesteps total: 24000
Training iteration: 3, Timesteps total: 36000
Training iteration: 4, Timesteps total: 48000
Training iteration: 5, Timesteps total: 60000
Training iteration: 6, Timesteps total: 72000
Training iteration: 7, Timesteps total: 84000
Training iteration: 8, Timesteps total: 96000
Training iteration: 9, Timesteps total: 108000
Training iteration: 10, Timesteps total: 120000
Training iteration: 11, Timesteps total: 11600
Training iteration: 12, Timesteps total: 23200
Training iteration: 13, Timesteps total: 34800
Training iteration: 14, Timesteps total: 46400
Training iteration: 15, Timesteps total: 11400
Training iteration: 16, Timesteps total: 22800
Training iteration: 17, Timesteps total: 34200
Training iteration: 18, Timesteps total: 45600
Training iteration: 19, Timesteps total: 57000
Training iteration: 20, Timesteps total: 68400
Training iteration: 21, Timesteps total: 11200
Training iteration: 22, Timesteps total: 22400
Training iteration: 23, Timesteps total: 33600
Training iteration: 24, Timesteps total: 44800
Training iteration: 25, Timesteps total: 56000
Training iteration: 26, Timesteps total: 67200
Training iteration: 27, Timesteps total: 78400
Many thanks!