How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I am trying to reload an Evolution Strategies agent from a checkpoint file during rollout/evaluation of the agent. I got this error when trying to evaluate an EvoStrat agent that had been trained for 200+ iters. To check this error in a consistent environment, I quickly trained another 1-iteration agent with the same parameters and tried to reload it from its first checkpoint. It resulted in the same error:
Traceback (most recent call last):
File "/work/slzhou/telescope-scheduler-v2/evaluate/rollout_agent.py", line 138, in <module>
rollout_agent(config, args.checkpoint)
File "/work/slzhou/telescope-scheduler-v2/evaluate/rollout_agent.py", line 92, in rollout_agent
agent = load_checkpoint(exp_config, checkpoint)
File "/work/slzhou/telescope-scheduler-v2/evaluate/rollout_agent.py", line 66, in load_checkpoint
agent.restore(agent_path)
File "/home/slzhou/anaconda3/lib/python3.9/site-packages/ray/tune/trainable.py", line 467, in restore
self.load_checkpoint(checkpoint_path)
File "/home/slzhou/anaconda3/lib/python3.9/site-packages/ray/rllib/agents/trainer.py", line 1823, in load_checkpoint
self.__setstate__(extra_data)
File "/home/slzhou/anaconda3/lib/python3.9/site-packages/ray/rllib/agents/es/es.py", line 411, in __setstate__
self.policy.set_flat_weights(state["weights"])
File "/home/slzhou/anaconda3/lib/python3.9/site-packages/ray/rllib/agents/es/es_tf_policy.py", line 182, in set_flat_weights
self.variables.set_flat(x)
File "/home/slzhou/anaconda3/lib/python3.9/site-packages/ray/experimental/tf_utils.py", line 143, in set_flat
arrays = unflatten(new_weights, shapes)
File "/home/slzhou/anaconda3/lib/python3.9/site-packages/ray/experimental/tf_utils.py", line 18, in unflatten
assert len(vector) == i, "Passed weight does not have the correct shape."
AssertionError: Passed weight does not have the correct shape.
Here is our agent reload code:
agent_config = es.DEFAULT_CONFIG.copy()
agent_config["log_level"] = "WARN"
agent_config["explore"] = False
agent = es.ESTrainer(agent_config, env='environment_name')
agent_path = f"checkpoint_path"
agent.restore(agent_path)
Here is our training code:
for step in tqdm.trange(n_iter):
training_result = agent.train()
agent.save(checkpoint_dir)
We are running all of this on a Linux Red Hat server. The CPU model name: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
To my understanding, this is what is happening: we train an agent with a set of agent configs in an environment, then, during rollout, we generate an agent with the same configs and same environment, but the parameters for this reloaded agent are not of the same size as the trained one. I’m not sure how this is happening, or what we may be doing wrong, so any pointers would be helpful.