Use Policy_Trainer with TensorBoard

You can provide a name to that I think should affect the name in the log directory.

that’d would be very nice, is this as simple as adding a few lines to make ?, name="run_name")

    print(f"Finished train run #{i + 1}")
    i += 1
    if i % 2 == 0:
        checkpoint =
        print("Last checkpoint", checkpoint)

That’s current my loop. Do I change print(pretty_print(trainer.train())) to, name=“run_name”)?

You could do for example,“PPO”, config=config, stop={#your stop criteria}, name=“run_name”)

Another option is that it should include the env name so you could register the env with different names that include extra info you want on each run.

I don’t really have a stop criteria though - I just want to make a save of the model every x iterations (each iteration takes like 1-4 minutes to train and like 35 mins to collect enough samples)…

Then you can leave it out and it will train forever

Another question, if I pause the training, and load it again, a new folder is created with the values continuing → is there a way to merge them into the same train? thingy so its a single color and continous?

I want to be able to do something like this:

counter = 1

while True:
     result =“PPO”, config=config, stop={counter % 1 == 0}, name=“run_name”)
     counter += 1

Is that possible with tune?


You could do that but you do not really need to. has some checkpointing parameters that you could use.

keep_checkpoints_num (int) – Number of checkpoints to keep. A value of None keeps all checkpoints. Defaults to None. If set, need to provide checkpoint_score_attr.

checkpoint_score_attr (str) – Specifies by which attribute to rank the best checkpoint. Default is increasing order. If attribute starts with min- it will rank attribute in decreasing order, i.e. min-validation_loss.

checkpoint_freq (int) – How many training iterations between checkpoints. A value of 0 (default) disables checkpointing. This has no effect when using the Functional Training API.

checkpoint_at_end (bool) – Whether to checkpoint at the end of the experiment regardless of the checkpoint_freq. Default is False. This has no effect when using the Functional Training API


What version of ray are you currently using?

There is a bug with rnn sequencing in the latest release.

You can avoid it with these settings (assuming you are not trying to train with multi-gpu).

Screenshot because I am remoting into my training machine.
I am using only a single gpu to train on a single machine (different machines for collecting sample but I don’t think that matters?)

Do I need to set simple optimised in my case?

P.S., name=args.checkpoint, keep_checkpoints_num = None, checkpoint_score_attr = "episode_reward_mean", checkpoint_freq = 1, checkpoint_at_end = True)
Would save it in a file under ~ray_tune/args.checkpoints. And if I need to continue training, I would just pass resume=True?


Personally I would until this issue is closed: [Bug] [rllib] RNN sequencing is incorrect · Issue #19976 · ray-project/ray · GitHub.

The simple_optimizer should still be able to use 1 gpu just fine.

That looks good to me.

Resume is for if a training failed for some reason.

Restore is to continue training by reloading from a user specified checkpoint.

1 Like
DEFAULT_CONFIG = with_common_config({
    # Should use a critic as a baseline (otherwise don't use value baseline;
    # required for using GAE).
    "use_critic": True,
    # If true, use the Generalized Advantage Estimator (GAE)
    # with a value function, see
    "use_gae": True,
    # The GAE (lambda) parameter.
    "lambda": 0.995,
    # Initial coefficient for KL divergence.
    "kl_coeff": 0.2,
    # Size of batches collected from each worker.
    "rollout_fragment_length": 64,
    # Number of timesteps collected for each SGD round. This defines the size
    # of each SGD epoch.
    "train_batch_size": 7168,
    # Total SGD batch size across all devices for SGD. This defines the
    # minibatch size within each epoch.
    "sgd_minibatch_size": 128,
    # Number of SGD iterations in each outer loop (i.e., number of epochs to
    # execute per train batch).
    "num_sgd_iter": 10,
    # Whether to shuffle sequences in the batch when training (recommended).
    "shuffle_sequences": False,
    # Stepsize of SGD.
    "lr": 3e-5,
    # Learning rate schedule.
    "lr_schedule": None,
    # Coefficient of the value function loss. IMPORTANT: you must tune this if
    # you set vf_share_layers=True inside your model's config.
    "vf_loss_coeff": 1.25,
    "model": {
        # Share layers for value function. If you set this to True, it's
        # important to tune vf_loss_coeff.
        "vf_share_layers": False,

        "fcnet_hiddens": [1024, 1024],
        "fcnet_activation": "relu",
        "use_lstm": True,
        "max_seq_len": 16,
        "lstm_cell_size": 512,
        "lstm_use_prev_action": False
    # Coefficient of the entropy regularizer.
    "entropy_coeff": 0.00005,
    # Decay schedule for the entropy regularizer.
    "entropy_coeff_schedule": None,
    # PPO clip parameter.
    "clip_param": 0.3,
    # Clip param for the value function. Note that this is sensitive to the
    # scale of the rewards. If your expected V is large, increase this.
    "vf_clip_param": 30.0,
    # If specified, clip the global norm of gradients by this amount.
    "grad_clip": None,
    # Target value for KL divergence.
    "kl_target": 0.02,
    # Whether to rollout "complete_episodes" or "truncate_episodes".
    "batch_mode": "complete_episodes",
    # Which observation filter to apply to the observation.
    "observation_filter": "NoFilter",
    # Uses the sync samples optimizer instead of the multi-gpu one. This is
    # usually slower, but you might want to try it if you run into issues with
    # # the default optimizer.
    "simple_optimizer": True,
    #"reuse_actors": True,
    "num_gpus": 1,
    # Use the connector server to generate experiences.
    "input": (
        lambda ioctx: PolicyServerInput(ioctx, args.ip, 55556)
    # Use a single worker process to run the server.
    "num_workers": 0,
    # Disable OPE, since the rollouts are coming from online clients.
    "input_evaluation": [],
    # "callbacks": MyCallbacks,
    "env_config": {"sleep": True},
    "framework": "tf",
    # "eager_tracing": True,
    "explore": True,
    "create_env_on_driver": False,
    "log_sys_usage": False,
    "compress_observations": True

allianceId = 27
heroId = 72
localHeroId = 100
itemId = 70
localItemId = 10
x = 8
y = 5
DEFAULT_CONFIG["env_config"]["observation_space"] = ......
DEFAULT_CONFIG["env_config"]["action_space"] = ....


trainer = PPOTrainer(config=DEFAULT_CONFIG, env=RandomEnv), name=args.checkpoint, keep_checkpoints_num = None, checkpoint_score_attr = "episode_reward_mean", checkpoint_freq = 1, checkpoint_at_end = True)

Is that good? Note I added the simple in the overall trainer config

1 Like

So something is a little off:

@Denys_Ashikhin ,

it looks like you did not provide with a Trainable or a registered Trainer. You must specify this either by using an already registered Trainer like

import ray
from ray import tune

    stop={"episode_reward_mean": 200},
        "env": "CartPole-v0",
        "num_gpus": 0,
        "num_workers": 1,
        "lr": tune.grid_search([0.01, 0.001, 0.0001]),

or by creating a Trainable by using build_trainer() from ray.rllib.agents.trainer_template like:

MyTrainer = build_trainer(

my_trainer = MyTrainer(...)
..., ...)

Hope this helps

1 Like

Thanks everyone, seems like I managed to get it working with everyone’s help.

Just one question, why is it printing twice?

This is simply a mechanical thing every couple of seconds tune prints. If it takes longer, you get more prints (all identical in the most cases), if it is faster you get less. See this answer to my question for more infos.

I think you forgot to link the question :sweat_smile: