ValueError: Must pass in RNN state batches for placeholders [<tf.Tensor 'default_policy/Placeholder:0' shape=(?, 256) dtype=float32>, <tf.Tensor 'default_policy/Placeholder_1:0' shape=(?, 256) dtype=float32>], got []

powxoper · June 19, 2021, 10:16am

Hello Ray Team,

After I train PPO model (with LSTM_Use == True) and try to evaluate model using “compute_actions method” and “compute_single_action method” I get error as follows

ValueError: Must pass in RNN state batches for placeholders [<tf.Tensor ‘default_policy/Placeholder:0’ shape=(?, 256) dtype=float32>, <tf.Tensor ‘default_policy/Placeholder_1:0’ shape=(?, 256) dtype=float32>], got []

I’m not sure how do I input the argument in required parameter of compute_actions method or compute_single_action method. Normally, when I use PPO without LSTM , I simply add argument like this

current_state = array([10., 5., 1., 2., 4., 2., 1., 5., 3., 5., 3., 4., 1., 5., 4., 4., 0.])
action = policy.compute_single_action(current_state ,state=[])

My Environment
OS: Windows 10
Python: 3.7.4
Tensorflow: 2.1.0
Numpy:1.18.5
Ray:1.0.0

My PPO Config:
{‘num_workers’: 2,
‘num_envs_per_worker’: 1,
‘rollout_fragment_length’: 200,
‘batch_mode’: ‘truncate_episodes’,
‘num_gpus’: 1,
‘train_batch_size’: 5000,
‘model’: {‘fcnet_hiddens’: [256, 256],
‘fcnet_activation’: ‘elu’,
‘conv_filters’: None,
‘conv_activation’: ‘elu’,
‘free_log_std’: False,
‘no_final_linear’: False,
‘vf_share_layers’: True,
‘use_lstm’: True,
‘max_seq_len’: 20,
‘lstm_cell_size’: 256,
‘lstm_use_prev_action_reward’: False,
‘_time_major’: False,
‘framestack’: False,
‘dim’: 84,
‘grayscale’: False,
‘zero_mean’: True,
‘custom_model’: None,
‘custom_model_config’: {},
‘custom_action_dist’: None,
‘custom_preprocessor’: None},
‘optimizer’: {},
‘gamma’: 0.99,
‘horizon’: None,
‘soft_horizon’: False,
‘no_done_at_end’: False,
‘env_config’: {},
‘env’: ‘SimpleSupplyChain’,
‘normalize_actions’: False,
‘clip_rewards’: True,
‘clip_actions’: True,
‘preprocessor_pref’: ‘deepmind’,
‘lr’: 5e-05,
‘monitor’: False,
‘log_level’: ‘WARN’,
‘callbacks’: ray.rllib.agents.callbacks.DefaultCallbacks,
‘ignore_worker_failures’: False,
‘log_sys_usage’: True,
‘fake_sampler’: False,
‘framework’: ‘tf’,
‘eager_tracing’: False,
‘no_eager_on_workers’: False,
‘explore’: True,
‘exploration_config’: {‘type’: ‘StochasticSampling’},
‘evaluation_interval’: None,
‘evaluation_num_episodes’: 10,
‘in_evaluation’: False,
‘evaluation_config’: {},
‘evaluation_num_workers’: 0,
‘custom_eval_function’: None,
‘sample_async’: False,
‘_use_trajectory_view_api’: False,
‘observation_filter’: ‘NoFilter’,
‘synchronize_filters’: True,
‘tf_session_args’: {‘intra_op_parallelism_threads’: 8,
‘inter_op_parallelism_threads’: 8,
‘gpu_options’: {‘allow_growth’: True},
‘log_device_placement’: False,
‘device_count’: {‘CPU’: 1},
‘allow_soft_placement’: True},
‘local_tf_session_args’: {‘intra_op_parallelism_threads’: 8,
‘inter_op_parallelism_threads’: 8},
‘compress_observations’: False,
‘collect_metrics_timeout’: 180,
‘metrics_smoothing_episodes’: 100,
‘remote_worker_envs’: False,
‘remote_env_batch_wait_ms’: 0,
‘min_iter_time_s’: 0,
‘timesteps_per_iteration’: 0,
‘seed’: None,
‘extra_python_environs_for_driver’: {},
‘extra_python_environs_for_worker’: {},
‘num_cpus_per_worker’: 1,
‘num_gpus_per_worker’: 0,
‘custom_resources_per_worker’: {},
‘num_cpus_for_driver’: 1,
‘memory’: 0,
‘object_store_memory’: 0,
‘memory_per_worker’: 0,
‘object_store_memory_per_worker’: 0,
‘input’: ‘sampler’,
‘input_evaluation’: [‘is’, ‘wis’],
‘postprocess_inputs’: False,
‘shuffle_buffer_size’: 0,
‘output’: None,
‘output_compress_columns’: [‘obs’, ‘new_obs’],
‘output_max_file_size’: 67108864,
‘multiagent’: {‘policies’: {},
‘policy_mapping_fn’: None,
‘policies_to_train’: None,
‘observation_fn’: None,
‘replay_mode’: ‘independent’},
‘logger_config’: None,
‘replay_sequence_length’: 1,
‘use_critic’: True,
‘use_gae’: True,
‘lambda’: 0.95,
‘kl_coeff’: 0.5,
‘sgd_minibatch_size’: 500,
‘shuffle_sequences’: True,
‘num_sgd_iter’: 10,
‘lr_schedule’: None,
‘vf_share_layers’: True,
‘vf_loss_coeff’: 1.0,
‘entropy_coeff’: 0.01,
‘entropy_coeff_schedule’: None,
‘clip_param’: 0.1,
‘vf_clip_param’: 1000000,
‘grad_clip’: None,
‘kl_target’: 0.01,
‘simple_optimizer’: False,
‘_fake_gpus’: False,
‘worker_index’: 0}

My Summary Model:

How do I input the parameter in Compute_action method and Compute_single_action method?
Do I need to reshape my input size? What shape size do I need to reshape to?

P.S. I create custom Supply Chain Environment in Gym

Thank you
Pond

mannyv · June 19, 2021, 11:28am

Hi @powxoper,
You want something like this.

current_obs = array([10., 5., 1., 2., 4., 2., 1., 5., 3., 5., 3., 4., 1., 5., 4., 4., 0.])
state=policy.get_initial_state()
action, state = policy.compute_single_action(current_obs,state=state)

powxoper · June 19, 2021, 11:41am

Thank you mannyv for your reply

After I tried, I get this error

What I should do next?

mannyv · June 19, 2021, 12:11pm

Opps I forgot to include info. Sorry about that. How about this:

current_obs = array([10., 5., 1., 2., 4., 2., 1., 5., 3., 5., 3., 4., 1., 5., 4., 4., 0.])
state=policy.get_initial_state()
action, state, info = policy.compute_single_action(current_obs,state=state)

powxoper · June 19, 2021, 12:32pm

Thank you very much @mannyv

I get the result finally.

powxoper · June 19, 2021, 12:33pm

@mannyv

I have a question for next action?
After Initial State, next action I only change current obs which according to changing environment, right? And for the state I need to keep and put it in the argument of the method? Or I need to use use state=policy.get_initial_state() all the time?

My Environment is (focus only 1 warehouse)

my_obs (will change everyday) = [inventory_of_factory, inventory_of_warehouse, demand_14th_timestep,demand_13th_timestep,demand_12th_timestep,…demand_1st_timestep,]

My actions = [#produce@factory, #deliver_to_warehouse]

mannyv · June 19, 2021, 3:05pm

Are you trying to evaluate a trained policy or do training on your policy?

The way you are doing it now is good for evaluation and debugging but it will not train the policy.

In general you (or the library of you are calling trainer.train() or tune.run()) are going to call get _initial_state() every time you call env.reset() which you would do when your environment returns done==True. On every step of the environment you will use the state returned from the previous compute_actions as the input state for the current call to compute_actions.

powxoper · June 20, 2021, 2:46am

I get it. Thank you very much @mannyv

Topic		Replies	Views
[Rllib] compute_single_action() with an LSTM-PPO trainer fails RLlib	1	975	February 3, 2023
Export LSTM model with tensorflow : Placeholder issue RLlib	2	674	August 3, 2021
[rllib] Problem running compute_single_action from PPO restored checkpoint Checkpointing, Restoring	1	342	December 13, 2023
[RLlib] Restoring a GTrXLNet or use_attention=True fails RLlib	1	752	June 3, 2021
Feeding issue for timestep placeholder in Ray 1.0.1.post1 RLlib	7	1148	February 24, 2021

ValueError: Must pass in RNN state batches for placeholders [<tf.Tensor 'default_policy/Placeholder:0' shape=(?, 256) dtype=float32>, <tf.Tensor 'default_policy/Placeholder_1:0' shape=(?, 256) dtype=float32>], got []

Related topics