Restoring from checkpoint - New Graph is created - Performance drop?

Hi! – I am restoring my TD3 agent from a checkpoint. When starting the training from the checkpoint again, I noticed that a new graph appears for the restored agent in Tensorboard.


Does restoring create a completely new experiment?
Is it possible to keep continue “using” the old graph / data after restoring? Or somehow uniting both again?

Furthermore, it seems there is a performance drop after restoring and continuing the training. However, maybe this is not really a performance drop – but this could simply happen because the “new” running mean value for the second graph takes some time to get to the level of the previous mean value (?)

I am using Rllib 1.2
My code for restoring:

def my_train_fn(config, reporter):
    agent = TD3Trainer(config=config, env="guidance-continuous-v0")

    checkpoint_path = f'{checkpoint_dir}/checkpoints/checkpoint_6001/checkpoint-6001'
    agent.restore(checkpoint_path)

    for i in range(5000):
        result = agent.train()
        if i % 100 == 0:
            checkpoint = agent.save(checkpoint_dir=f"{checkpoint_dir}/checkpoints")
            print("checkpoint saved at", checkpoint)
    agent.stop()

Thanks for your time and help.
Walt

Hey @Lauritowal , could you try this on the latest master?
We did some fixes recently (post 1.4) on properly including the exploration component’s state when saving/restoring. What I’m thinking is that after you training TD3 for some time, 1.2 RLlib does not properly save the timestep and hence when restoring starts with a 0 timestep again, leading to a large std for the gaussian exploration of TD3.

Thanks for the reply @sven1977

I tried version 2.0.0dev on Colab. But now I get the following error when running the training:

== Status ==
Memory usage on this node: 1.8/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.45 GiB heap, 0.0/3.72 GiB objects (0.0/1.0 accelerator_type:V100)
Result logdir: /root/ray_results/experiment_full_circle_elevator_nosincos_date_29-06-2021_time_11-28-44_seed_4_NUM_EPISODES_TRAINING_15000_restored
Number of trials: 1/1 (1 ERROR)
Trial name	status	loc
my_train_fn_None_299a9_00000	ERROR	

Number of errored trials: 1
Trial name	# failures	error file
my_train_fn_None_299a9_00000	1	/root/ray_results/experiment_full_circle_elevator_nosincos_date_29-06-2021_time_11-28-44_seed_4_NUM_EPISODES_TRAINING_15000_restored/my_train_fn_None_299a9_00000_0_2021-06-29_11-28-44/error.txt

(pid=2819) WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
(pid=2819) Instructions for updating:
(pid=2819) non-resource variables are not supported in the long term
(pid=2819) WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow_probability/python/internal/variadic_reduce.py:115: calling function (from tensorflow.python.eager.def_function) with experimental_compile is deprecated and will be removed in a future version.
(pid=2819) Instructions for updating:
(pid=2819) experimental_compile is deprecated, use jit_compile instead
(pid=2819) 2021-06-29 11:28:48,822	WARNING util.py:53 -- Install gputil for GPU system monitoring.
(pid=2819) 2021-06-29 11:28:49,527	ERROR worker.py:406 -- SystemExit was raised from the worker
(pid=2819) Traceback (most recent call last):
(pid=2819)   File "python/ray/_raylet.pyx", line 491, in ray._raylet.execute_task
(pid=2819)   File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task
(pid=2819)   File "python/ray/_raylet.pyx", line 444, in ray._raylet.execute_task.function_executor
(pid=2819)   File "/usr/local/lib/python3.7/dist-packages/ray/function_manager.py", line 556, in actor_method_executor
(pid=2819)     return method(__ray_actor, *args, **kwargs)
(pid=2819)   File "/usr/local/lib/python3.7/dist-packages/ray/actor.py", line 988, in __ray_terminate__
(pid=2819)     ray.actor.exit_actor()
(pid=2819)   File "/usr/local/lib/python3.7/dist-packages/ray/actor.py", line 1064, in exit_actor
(pid=2819)     raise exit
(pid=2819) SystemExit: 0
(pid=2819) 
(pid=2819) During handling of the above exception, another exception occurred:
(pid=2819) 
(pid=2819) Traceback (most recent call last):
(pid=2819)   File "python/ray/_raylet.pyx", line 591, in ray._raylet.task_execution_handler
(pid=2819)   File "python/ray/_raylet.pyx", line 446, in ray._raylet.execute_task
(pid=2819)   File "python/ray/_raylet.pyx", line 484, in ray._raylet.execute_task
(pid=2819)   File "python/ray/includes/libcoreworker.pxi", line 33, in ray._raylet.ProfileEvent.__exit__
(pid=2819)   File "/usr/lib/python3.7/traceback.py", line 167, in format_exc
(pid=2819)     return "".join(format_exception(*sys.exc_info(), limit=limit, chain=chain))
(pid=2819)   File "/usr/lib/python3.7/traceback.py", line 121, in format_exception
(pid=2819)     type(value), value, tb, limit=limit).format(chain=chain))
(pid=2819)   File "/usr/lib/python3.7/traceback.py", line 508, in __init__
(pid=2819)     capture_locals=capture_locals)
(pid=2819)   File "/usr/lib/python3.7/traceback.py", line 359, in extract
(pid=2819)     linecache.checkcache(filename)
(pid=2819)   File "/usr/lib/python3.7/linecache.py", line 74, in checkcache
(pid=2819)     stat = os.stat(fullname)
(pid=2819)   File "/usr/local/lib/python3.7/dist-packages/ray/worker.py", line 403, in sigterm_handler
(pid=2819)     sys.exit(1)
(pid=2819) SystemExit: 1
---------------------------------------------------------------------------
TuneError                                 Traceback (most recent call last)
<ipython-input-20-5ba34e9666d1> in <module>()
      6              restore=checkpoint_path,
      7              name=experiment_name,
----> 8              config=config
      9 )
     10 

/usr/local/lib/python3.7/dist-packages/ray/tune/tune.py in run(run_or_experiment, name, metric, mode, stop, time_budget_s, config, resources_per_trial, num_samples, local_dir, search_alg, scheduler, keep_checkpoints_num, checkpoint_score_attr, checkpoint_freq, checkpoint_at_end, verbose, progress_reporter, log_to_file, trial_name_creator, trial_dirname_creator, sync_config, export_formats, max_failures, fail_fast, restore, server_port, resume, queue_trials, reuse_actors, trial_executor, raise_on_failed_trial, callbacks, loggers, ray_auto_init, run_errored_only, global_checkpoint_period, with_server, upload_dir, sync_to_cloud, sync_to_driver, sync_on_checkpoint, _remote)
    541     if incomplete_trials:
    542         if raise_on_failed_trial and not state[signal.SIGINT]:
--> 543             raise TuneError("Trials did not complete", incomplete_trials)
    544         else:
    545             logger.error("Trials did not complete: %s", incomplete_trials)

TuneError: ('Trials did not complete', [my_train_fn_None_299a9_00000])

My code:

analysis = tune.run(my_train_fn,
             restore=checkpoint_path,
             name=experiment_name,
             config=config
)

Strange, could you send the entire stacktrace? The actual information (on the RLlib/tune error) is further above, I think. @Lauritowal . Thanks :slight_smile:

Hi Sven.
Thank you very much for your time and patience!

I’ve tried it again today. I’ve installed the latest ray[rllib] wheel on Google Colab:

!pip install "ray[rllib]@https://s3-us-west-2.amazonaws.com/ray-wheels/master/ba6cebe30fab6925e5b2d9e859ad064d53015246/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl"

However, when I start the training, all I get is the same PENDING message over and over again:

2021-07-12 07:41:07,389	WARNING tune.py:494 -- Tune detects GPUs, but no trials are using GPUs. To enable trials to use GPUs, set tune.run(resources_per_trial={'gpu': 1}...) which allows Tune to expose 1 GPU to each trial. You can also override `Trainable.default_resource_request` if using the Trainable API.
== Status ==
Memory usage on this node: 2.5/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.47 GiB heap, 0.0/3.73 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/experiment_wind_3_date_12-07-2021_time_07-41-07_seed_4
Number of trials: 1/1 (1 PENDING)
Trial name	status	loc
my_train_fn_None_84d45_00000	PENDING	


== Status ==
Memory usage on this node: 2.5/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.47 GiB heap, 0.0/3.73 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/experiment_wind_3_date_12-07-2021_time_07-41-07_seed_4
Number of trials: 1/1 (1 PENDING)
Trial name	status	loc
my_train_fn_None_84d45_00000	PENDING	


== Status ==
Memory usage on this node: 2.5/12.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/1 GPUs, 0.0/7.47 GiB heap, 0.0/3.73 GiB objects (0.0/1.0 accelerator_type:P100)
Result logdir: /root/ray_results/experiment_wind_3_date_12-07-2021_time_07-41-07_seed_4
Number of trials: 1/1 (1 PENDING)
Trial name	status	loc
my_train_fn_None_84d45_00000	PENDING	

....

It seems like the resources are not set, if you have a look at the warning at the beginning of the error message above:

2021-07-12 07:41:07,389 WARNING tune.py:494 – Tune detects GPUs, but no trials are using GPUs. To enable trials to use GPUs, set tune.run(resources_per_trial={‘gpu’: 1}…) which allows Tune to expose 1 GPU to each trial. You can also override Trainable.default_resource_request if using the Trainable API.

I got the resources via:
resources = TD3Trainer.default_resource_request(config)

And started the training with them as in here:

tune.run(my_train_fn,
             restore=checkpoint_path,
             checkpoint_at_end=True,
             reuse_actors=True,
             name="example_name",
             resources_per_trial=resources,
             config=config)

Any idea what is happening there? :confused:

Hey @Lauritowal , yeah, seems like it’s waiting for resources to be available. I see you only have 2 CPUs available. How many workers are you using (num_workers=?)?

Hi @sven1977,

“num_gpus”: 1,
“num_workers”: 1,
“num_envs_per_worker”: 3,

this is the complete config:

custom_config = {
        "lr": 0.0001, # tune.grid_search([0.01, 0.001, 0.0001]),
        "framework": "torch",
        "callbacks": CustomCallbacks,
        "log_level": "WARN",
        "evaluation_interval": 20,
        "evaluation_num_episodes": 10,
        "num_gpus": 1,
        "num_workers": 1,
        "num_envs_per_worker": 3,
        "seed": SEED,
        "evaluation_config": {
            "explore": False
        },
        "evaluation_num_workers": 0,
        "env_config": {
            "jsbsim_path": JSBSIM_PATH_DRIVE,
            "flightgear_path": "",
            "aircraft": cessna172P,
            "agent_interaction_freq": 5,
            "target_radius": 100 / 1000,
            "max_distance_km": 4,
            "max_target_distance_km": 2, 
            "max_episode_time_s": 60 * 5,
            "phase": 0,
            "render_progress_image": False,
            "render_progress_image_path": './data',
            "offset": 0,
            "evaluation": False,
            "seed": SEED,
        }
}

The thing is, that it does work for rllib v 1.2

for rllib v1.2 I would instead use:
resources.to_json()

tune.run(my_train_fn,
             restore=checkpoint_path,
             checkpoint_at_end=True,
             reuse_actors=True,
             name="example_name",
             resources_per_trial=resources.to_json(),
             config=config)

Wheras 2.0.0.dev0 does not work…

Thanks for your help!