Reload agent: "Passed weight does not have the correct shape."

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I am trying to reload an Evolution Strategies agent from a checkpoint file during rollout/evaluation of the agent. I got this error when trying to evaluate an EvoStrat agent that had been trained for 200+ iters. To check this error in a consistent environment, I quickly trained another 1-iteration agent with the same parameters and tried to reload it from its first checkpoint. It resulted in the same error:

Traceback (most recent call last):
  File "/work/slzhou/telescope-scheduler-v2/evaluate/rollout_agent.py", line 138, in <module>
    rollout_agent(config, args.checkpoint)
  File "/work/slzhou/telescope-scheduler-v2/evaluate/rollout_agent.py", line 92, in rollout_agent
    agent = load_checkpoint(exp_config, checkpoint)
  File "/work/slzhou/telescope-scheduler-v2/evaluate/rollout_agent.py", line 66, in load_checkpoint
    agent.restore(agent_path)
  File "/home/slzhou/anaconda3/lib/python3.9/site-packages/ray/tune/trainable.py", line 467, in restore
    self.load_checkpoint(checkpoint_path)
  File "/home/slzhou/anaconda3/lib/python3.9/site-packages/ray/rllib/agents/trainer.py", line 1823, in load_checkpoint
    self.__setstate__(extra_data)
  File "/home/slzhou/anaconda3/lib/python3.9/site-packages/ray/rllib/agents/es/es.py", line 411, in __setstate__
    self.policy.set_flat_weights(state["weights"])
  File "/home/slzhou/anaconda3/lib/python3.9/site-packages/ray/rllib/agents/es/es_tf_policy.py", line 182, in set_flat_weights
    self.variables.set_flat(x)
  File "/home/slzhou/anaconda3/lib/python3.9/site-packages/ray/experimental/tf_utils.py", line 143, in set_flat
    arrays = unflatten(new_weights, shapes)
  File "/home/slzhou/anaconda3/lib/python3.9/site-packages/ray/experimental/tf_utils.py", line 18, in unflatten
    assert len(vector) == i, "Passed weight does not have the correct shape."
AssertionError: Passed weight does not have the correct shape.

Here is our agent reload code:

agent_config = es.DEFAULT_CONFIG.copy()
agent_config["log_level"] = "WARN"
agent_config["explore"] = False

agent = es.ESTrainer(agent_config, env='environment_name')
agent_path = f"checkpoint_path"

agent.restore(agent_path)

Here is our training code:

for step in tqdm.trange(n_iter):

        training_result = agent.train()

        agent.save(checkpoint_dir)

We are running all of this on a Linux Red Hat server. The CPU model name: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz

To my understanding, this is what is happening: we train an agent with a set of agent configs in an environment, then, during rollout, we generate an agent with the same configs and same environment, but the parameters for this reloaded agent are not of the same size as the trained one. I’m not sure how this is happening, or what we may be doing wrong, so any pointers would be helpful.