Hi!
Ray: v1.0.1
Tensorflow: 2.0
Python: 3.6.9
Ubuntu: 16.04
Head node VM: 64 cores, 504 Gb memory
In my tensorboard plots, every time a worker is blacklisted, the timesteps_total
value gets reset and I get these zig zag plots against number of training steps. Here is the step-based TB plot for num_healthy_workers
:
Plotting training against wall clock time, however, gives me linear plots. Do you happen to know what might cause timesteps_total
to reset back every time num_healthy_workers
decreases?
Below I attached a list of how the timesteps_total
gets reset between training iterations:
Training iteration: 1, Timesteps total: 12000
Training iteration: 2, Timesteps total: 24000
Training iteration: 3, Timesteps total: 36000
Training iteration: 4, Timesteps total: 48000
Training iteration: 5, Timesteps total: 60000
Training iteration: 6, Timesteps total: 72000
Training iteration: 7, Timesteps total: 84000
Training iteration: 8, Timesteps total: 96000
Training iteration: 9, Timesteps total: 108000
Training iteration: 10, Timesteps total: 120000
Training iteration: 11, Timesteps total: 11600
Training iteration: 12, Timesteps total: 23200
Training iteration: 13, Timesteps total: 34800
Training iteration: 14, Timesteps total: 46400
Training iteration: 15, Timesteps total: 11400
Training iteration: 16, Timesteps total: 22800
Training iteration: 17, Timesteps total: 34200
Training iteration: 18, Timesteps total: 45600
Training iteration: 19, Timesteps total: 57000
Training iteration: 20, Timesteps total: 68400
Training iteration: 21, Timesteps total: 11200
Training iteration: 22, Timesteps total: 22400
Training iteration: 23, Timesteps total: 33600
Training iteration: 24, Timesteps total: 44800
Training iteration: 25, Timesteps total: 56000
Training iteration: 26, Timesteps total: 67200
Training iteration: 27, Timesteps total: 78400
Many thanks!