PPO only run several steps in one episode

ZiyanLydia · September 10, 2024, 1:47am

How severe does this issue affect your experience of using Ray?

None: Just asking a question out of curiosity
Low: It annoys or frustrates me for a moment.
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
High: It blocks me to complete my task.
High

Hi all,

I am training a PPO using my custom environment. The code can run, but the process stops in 3 or 4 steps in one episode. (there are time steps in my environment). I would appreciate it a lot if someone could help me with the code. Following is my config code.

def train_ppo():
    config = (
        PPOConfig()
        .training(
            train_batch_size_per_learner=64,
            mini_batch_size_per_learner=64,
            lambda_=0.95,
            kl_coeff=0.5,
            clip_param=0.1,
            vf_clip_param=10.0,
            entropy_coeff=0.01,
            num_sgd_iter=10,
            lr=0.00015,
            grad_clip=100.0,
            grad_clip_by="global_norm",
        )
        .environment("my_env", env_config=env_config)
        .rollouts(
            num_rollout_workers=1,
            rollout_fragment_length=50,
        )
        .rl_module(
            model_config_dict={
                "fcnet_hiddens": [512, 512],
                "fcnet_activation": "tanh"
            }
        )
        .resources(
            num_gpus=0,
        )
        .callbacks(CustomCallbacks)
        .debugging(log_level="DEBUG")
    )
    algo = config.build()
    result = algo.train()

train_ppo()

Another problem is I don’t know how to output the reward I got. There is no reward in result. And I can confirm that I have return reward in my step function in the environment.

Thank you!!

BR

mannyv · September 10, 2024, 1:01pm

Hi @ZiyanLydia,

Your issue is this line here:

result = algo.train()

That will do one iteration of rollouts followed by optimization.

To do more than one iteration you either need to put it in a loop or more prefered is to specify a stopping condition and use the tune api.

Topic		Replies	Views
When run PPO,it can not calculate episode reward	0	249	August 18, 2023
PPO.train incorrect result RLlib	1	258	May 23, 2023
Issue with multiple environments training one PPO policy RLlib	0	19	May 25, 2025
Strange stuck in PPO algorithm training process RLlib	4	1104	June 22, 2022
Num_agent_steps_trained: 0 Configure Algorithm, Training, Evaluation, Scaling	2	242	May 4, 2024

PPO only run several steps in one episode

Related topics