Tensorboard folds back on itself (restarts at 0)

jmugan · October 1, 2021, 3:44pm

Hi,

When I run RLlib with tune, the tensorboard plots sometimes wrap around. Maybe it crashed and automatically restarted? But it wraps right at 1,000,000 so maybe there is a parameter somewhere?

See image.

gjoliver · October 1, 2021, 5:27pm

From the chart it seems we wrap around before hitting 900k steps?
How are you running this? on a local computer? on a local cluster? or a cluster on the cloud?
Did the job crash and get restarted? Can you see anything from the logs?

Maybe the training progress was restored, but the step is reset to 0.

jmugan · October 1, 2021, 5:46pm

Oh, yeah, you’re right, it does look more like 870k, so probably some sort of starting over.

I’m running this on a server with 20 cpus.

Looking at the logs, it looks like the environment crashed, and I have the config “ignore_worker_failures”: True

So it must be crashing and when it starts again it doesn’t track the global time step. Is there maybe some other configuration I should put in there?

Topic		Replies	Views
[RLlib] Timesteps total gets reset everytime 'num_healthy_workers' goes down RLlib	1	259	December 30, 2020
Use Policy_Trainer with TensorBoard RLlib	33	2354	November 13, 2021
Tensorboard file RLlib	1	260	June 29, 2021
Tensorboard stops working for no apparent reason. Could you help narrow down the issue? RLlib	0	697	March 14, 2024
Ray rllib tune.run() stuck in running RLlib	2	358	May 24, 2023

Tensorboard folds back on itself (restarts at 0)

Related topics