Reload agent: "Passed weight does not have the correct shape."

szhou0202 · May 18, 2022, 2:45pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I am trying to reload an Evolution Strategies agent from a checkpoint file during rollout/evaluation of the agent. I got this error when trying to evaluate an EvoStrat agent that had been trained for 200+ iters. To check this error in a consistent environment, I quickly trained another 1-iteration agent with the same parameters and tried to reload it from its first checkpoint. It resulted in the same error:

Traceback (most recent call last):
  File "/work/slzhou/telescope-scheduler-v2/evaluate/rollout_agent.py", line 138, in <module>
    rollout_agent(config, args.checkpoint)
  File "/work/slzhou/telescope-scheduler-v2/evaluate/rollout_agent.py", line 92, in rollout_agent
    agent = load_checkpoint(exp_config, checkpoint)
  File "/work/slzhou/telescope-scheduler-v2/evaluate/rollout_agent.py", line 66, in load_checkpoint
    agent.restore(agent_path)
  File "/home/slzhou/anaconda3/lib/python3.9/site-packages/ray/tune/trainable.py", line 467, in restore
    self.load_checkpoint(checkpoint_path)
  File "/home/slzhou/anaconda3/lib/python3.9/site-packages/ray/rllib/agents/trainer.py", line 1823, in load_checkpoint
    self.__setstate__(extra_data)
  File "/home/slzhou/anaconda3/lib/python3.9/site-packages/ray/rllib/agents/es/es.py", line 411, in __setstate__
    self.policy.set_flat_weights(state["weights"])
  File "/home/slzhou/anaconda3/lib/python3.9/site-packages/ray/rllib/agents/es/es_tf_policy.py", line 182, in set_flat_weights
    self.variables.set_flat(x)
  File "/home/slzhou/anaconda3/lib/python3.9/site-packages/ray/experimental/tf_utils.py", line 143, in set_flat
    arrays = unflatten(new_weights, shapes)
  File "/home/slzhou/anaconda3/lib/python3.9/site-packages/ray/experimental/tf_utils.py", line 18, in unflatten
    assert len(vector) == i, "Passed weight does not have the correct shape."
AssertionError: Passed weight does not have the correct shape.

Here is our agent reload code:

agent_config = es.DEFAULT_CONFIG.copy()
agent_config["log_level"] = "WARN"
agent_config["explore"] = False

agent = es.ESTrainer(agent_config, env='environment_name')
agent_path = f"checkpoint_path"

agent.restore(agent_path)

Here is our training code:

for step in tqdm.trange(n_iter):

        training_result = agent.train()

        agent.save(checkpoint_dir)

We are running all of this on a Linux Red Hat server. The CPU model name: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz

To my understanding, this is what is happening: we train an agent with a set of agent configs in an environment, then, during rollout, we generate an agent with the same configs and same environment, but the parameters for this reloaded agent are not of the same size as the trained one. I’m not sure how this is happening, or what we may be doing wrong, so any pointers would be helpful.

Topic		Replies	Views
Fails restoring weights #41508 RLlib	2	423	December 29, 2023
Crash when calling .train() after loading from checkpoint RLlib	2	406	February 9, 2022
RLLib Multiagent: Load only one policy from checkpoint & Compatibility of RLLib/Tune Checkpoints RLlib	9	3294	November 24, 2021
TF eager error (Executing eagerly) Configure Algorithm, Training, Evaluation, Scaling	2	461	February 6, 2023
ValueError when restoring checkpoint with PPO RLlib	1	512	October 20, 2022

Reload agent: "Passed weight does not have the correct shape."

Related topics