Tensorboard folds back on itself (restarts at 0)

Hi,

When I run RLlib with tune, the tensorboard plots sometimes wrap around. Maybe it crashed and automatically restarted? But it wraps right at 1,000,000 so maybe there is a parameter somewhere?

See image.

image

From the chart it seems we wrap around before hitting 900k steps?
How are you running this? on a local computer? on a local cluster? or a cluster on the cloud?
Did the job crash and get restarted? Can you see anything from the logs?

Maybe the training progress was restored, but the step is reset to 0.

Oh, yeah, you’re right, it does look more like 870k, so probably some sort of starting over.

I’m running this on a server with 20 cpus.

Looking at the logs, it looks like the environment crashed, and I have the config “ignore_worker_failures”: True

So it must be crashing and when it starts again it doesn’t track the global time step. Is there maybe some other configuration I should put in there?