How do I set up the training configuration of a DQN algorithm in order to be able to control the number of training steps performed in one iteration of algo.train()?
As in, the number of steps per episode before your environment gets reset and another episode is performed?
No, as in the number of steps corresponding to one training iteration, I.e. to one call to algo.train()
You could try something like this
custom_checkpoint_dir = "./your_directory"
print_guide = ["episodes_this_iter", "episode_reward_mean", "episode_reward_max", "episode_reward_min",
"policy_reward_mean", "done",
"num_env_steps_sampled_this_iter", ]
algo = config.build()
totalenvsteps=0
print("BEGINNING TRAINING")
for i in range(1, 1000):
result = algo.train()
print(f"Iteration: {i}")
for key in print_guide:
print(f'{key}: {result[key]}')
if i % 20 == 0:
checkpoint_dir = algo.save(custom_checkpoint_dir).checkpoint.path
print(f"Checkpoint saved in directory {checkpoint_dir}")
totalenvsteps+=result["num_env_steps_sampled_this_iter"]
print("Total Training Steps:", totalenvsteps)
print("\n")
or in your case
totalenvsteps=0
print("BEGINNING TRAINING")
while True:
result = algo.train()
for key in print_guide:
print(result[key])
print(result)
totalenvsteps+=result["num_env_steps_sampled_this_iter"]
print("Total Training Steps:", totalenvsteps)
if totalenvsteps>=4000:
break
because I dont believe you could specify the exact number through the config if your env varies in number of timesteps per episode This link here is pretty much the question you are asking)
now if you use ray tune you could specify how many timesteps the training runs by something like this
stop = {
"episode_reward_mean": args.stop_reward,
"timesteps_total": args.stop_timesteps,
}
# Get the path to the desktop
desktop_path = os.path.join(os.path.join(os.environ['USERPROFILE']), 'Desktop')
# Create a path to the RayResults folder on the desktop
ray_results_path = os.path.join(desktop_path, 'RayResults')
results = tune.Tuner(PPO, run_config=air.RunConfig(storage_path=ray_results_path, name=f"Ray_{date.today()}",stop=stop, verbose=1), param_space=config,).fit()
if args.run_as_test:
check_learning_achieved(results, args.stop_reward)
ray.shutdown()
where the you can define total timesteps in the args
1 Like