[rllib] Problem running compute_single_action from PPO restored checkpoint

jacob-thrackle · December 7, 2023, 8:17pm

I am restoring a checkpoint from a Tune experiment trial and am attempting to manually compute a single action from a policy given an observation.

Input:

obs = algo.get_policy("learned").observation_space.sample() # guaranteed to be the right dimensions
lstm_cell_size = algo.config.model["lstm_cell_size"]
state = [np.zeros([lstm_cell_size], np.float32) for _ in range(2)] # create an empty state

algo.compute_single_action(obs, state, policy_id="learned", explore=False)

algo is just a restored PPO instance. It appears that I must pass in a state to compute_single_action since I have use_lstm set to True but I get the following error (truncated):

File ~/miniconda3/envs/test/lib/python3.10/site-packages/ray/rllib/core/models/base.py:417, in StatefulActorCriticEncoder._forward(self, inputs, **kwargs)
    415 actor_inputs = inputs.copy()
    416 critic_inputs = inputs.copy()
--> 417 actor_inputs[STATE_IN] = inputs[STATE_IN][ACTOR]
    418 critic_inputs[STATE_IN] = inputs[STATE_IN][CRITIC]
    420 actor_out = self.actor_encoder(actor_inputs, **kwargs)

IndexError: too many indices for tensor of dimension 3

What gives? I’m not sure how to properly pass in the state (or even the dimensions of it, apparently).

My config for evaluation is identical to the config used for training with the exception of setting explore in the evaluation config to False:

config = (  # 1. Configure the algorithm,
    PPOConfig()
    .environment("basic_multi_agent")
    .rollouts(num_rollout_workers=2)
    .framework("torch")
    .training(model={"fcnet_hiddens": [64, 64]})
    .evaluation(evaluation_num_workers=2, evaluation_config={"explore": False})
    .multi_agent(
        policies={
            "random_action": PolicySpec(
                policy_class=RandomAction,
                observation_space=gym.spaces.Box(-1e18, 1e18, (4,)),
                action_space=gym.spaces.Discrete(3),
            ),
            "learned": PolicySpec(
                config=AlgorithmConfig.overrides(
                    model={"use_lstm": True},
                    framework_str="torch",
                ),
                observation_space=gym.spaces.Box(-1e18, 1e18, (4,)),
                action_space=gym.spaces.Discrete(3),
            ),
        },
        policy_mapping_fn=select_policy,
        policies_to_train=["learned"],
    )
)
config.sgd_minibatch_size = 128
config.train_batch_size = int(256 * 4)
config.env_config = {"num_agents": 50, "episode_length": 1000}

To be sure, these checkpoints were created during trials created during a call to tune.Tuner.fit(). I read other material with similar problems and tried to follow them to no avail.

jacob-thrackle · December 13, 2023, 7:23am

Solved, I was passing an incorrect state (should be a dict for PPO). Instead, do:

state = a.get_policy("learned").get_initial_state()

Topic		Replies	Views
[Rllib] compute_single_action() with an LSTM-PPO trainer fails RLlib	1	978	February 3, 2023
How to compute actions with RLlib and Tune after training RLlib	3	459	September 21, 2024
Restored Policy gives action that is out of bound Checkpointing, Restoring	1	577	April 13, 2023
Compute_single_action(obs, state) of policy and algo: different performance Checkpointing, Restoring	1	711	April 13, 2023
Policy.compute_single_action() wrong outputs RLlib	0	221	October 30, 2023

[rllib] Problem running compute_single_action from PPO restored checkpoint

Related topics