ValueError: Must pass in RNN state batches for placeholders [<tf.Tensor 'default_policy/Placeholder:0' shape=(?, 256) dtype=float32>, <tf.Tensor 'default_policy/Placeholder_1:0' shape=(?, 256) dtype=float32>], got []

Hello Ray Team,

After I train PPO model (with LSTM_Use == True) and try to evaluate model using “compute_actions method” and “compute_single_action method” I get error as follows

ValueError: Must pass in RNN state batches for placeholders [<tf.Tensor ‘default_policy/Placeholder:0’ shape=(?, 256) dtype=float32>, <tf.Tensor ‘default_policy/Placeholder_1:0’ shape=(?, 256) dtype=float32>], got []

I’m not sure how do I input the argument in required parameter of compute_actions method or compute_single_action method. Normally, when I use PPO without LSTM , I simply add argument like this

current_state = array([10., 5., 1., 2., 4., 2., 1., 5., 3., 5., 3., 4., 1., 5., 4., 4., 0.])
action = policy.compute_single_action(current_state ,state=[])

My Environment
OS: Windows 10
Python: 3.7.4
Tensorflow: 2.1.0
Numpy:1.18.5
Ray:1.0.0

My PPO Config:
{‘num_workers’: 2,
‘num_envs_per_worker’: 1,
‘rollout_fragment_length’: 200,
‘batch_mode’: ‘truncate_episodes’,
‘num_gpus’: 1,
‘train_batch_size’: 5000,
‘model’: {‘fcnet_hiddens’: [256, 256],
‘fcnet_activation’: ‘elu’,
‘conv_filters’: None,
‘conv_activation’: ‘elu’,
‘free_log_std’: False,
‘no_final_linear’: False,
‘vf_share_layers’: True,
‘use_lstm’: True,
‘max_seq_len’: 20,
‘lstm_cell_size’: 256,
‘lstm_use_prev_action_reward’: False,
‘_time_major’: False,
‘framestack’: False,
‘dim’: 84,
‘grayscale’: False,
‘zero_mean’: True,
‘custom_model’: None,
‘custom_model_config’: {},
‘custom_action_dist’: None,
‘custom_preprocessor’: None},
‘optimizer’: {},
‘gamma’: 0.99,
‘horizon’: None,
‘soft_horizon’: False,
‘no_done_at_end’: False,
‘env_config’: {},
‘env’: ‘SimpleSupplyChain’,
‘normalize_actions’: False,
‘clip_rewards’: True,
‘clip_actions’: True,
‘preprocessor_pref’: ‘deepmind’,
‘lr’: 5e-05,
‘monitor’: False,
‘log_level’: ‘WARN’,
‘callbacks’: ray.rllib.agents.callbacks.DefaultCallbacks,
‘ignore_worker_failures’: False,
‘log_sys_usage’: True,
‘fake_sampler’: False,
‘framework’: ‘tf’,
‘eager_tracing’: False,
‘no_eager_on_workers’: False,
‘explore’: True,
‘exploration_config’: {‘type’: ‘StochasticSampling’},
‘evaluation_interval’: None,
‘evaluation_num_episodes’: 10,
‘in_evaluation’: False,
‘evaluation_config’: {},
‘evaluation_num_workers’: 0,
‘custom_eval_function’: None,
‘sample_async’: False,
‘_use_trajectory_view_api’: False,
‘observation_filter’: ‘NoFilter’,
‘synchronize_filters’: True,
‘tf_session_args’: {‘intra_op_parallelism_threads’: 8,
‘inter_op_parallelism_threads’: 8,
‘gpu_options’: {‘allow_growth’: True},
‘log_device_placement’: False,
‘device_count’: {‘CPU’: 1},
‘allow_soft_placement’: True},
‘local_tf_session_args’: {‘intra_op_parallelism_threads’: 8,
‘inter_op_parallelism_threads’: 8},
‘compress_observations’: False,
‘collect_metrics_timeout’: 180,
‘metrics_smoothing_episodes’: 100,
‘remote_worker_envs’: False,
‘remote_env_batch_wait_ms’: 0,
‘min_iter_time_s’: 0,
‘timesteps_per_iteration’: 0,
‘seed’: None,
‘extra_python_environs_for_driver’: {},
‘extra_python_environs_for_worker’: {},
‘num_cpus_per_worker’: 1,
‘num_gpus_per_worker’: 0,
‘custom_resources_per_worker’: {},
‘num_cpus_for_driver’: 1,
‘memory’: 0,
‘object_store_memory’: 0,
‘memory_per_worker’: 0,
‘object_store_memory_per_worker’: 0,
‘input’: ‘sampler’,
‘input_evaluation’: [‘is’, ‘wis’],
‘postprocess_inputs’: False,
‘shuffle_buffer_size’: 0,
‘output’: None,
‘output_compress_columns’: [‘obs’, ‘new_obs’],
‘output_max_file_size’: 67108864,
‘multiagent’: {‘policies’: {},
‘policy_mapping_fn’: None,
‘policies_to_train’: None,
‘observation_fn’: None,
‘replay_mode’: ‘independent’},
‘logger_config’: None,
‘replay_sequence_length’: 1,
‘use_critic’: True,
‘use_gae’: True,
‘lambda’: 0.95,
‘kl_coeff’: 0.5,
‘sgd_minibatch_size’: 500,
‘shuffle_sequences’: True,
‘num_sgd_iter’: 10,
‘lr_schedule’: None,
‘vf_share_layers’: True,
‘vf_loss_coeff’: 1.0,
‘entropy_coeff’: 0.01,
‘entropy_coeff_schedule’: None,
‘clip_param’: 0.1,
‘vf_clip_param’: 1000000,
‘grad_clip’: None,
‘kl_target’: 0.01,
‘simple_optimizer’: False,
‘_fake_gpus’: False,
‘worker_index’: 0}

My Summary Model:

How do I input the parameter in Compute_action method and Compute_single_action method?
Do I need to reshape my input size? What shape size do I need to reshape to?

P.S. I create custom Supply Chain Environment in Gym

Thank you
Pond

Hi @powxoper,
You want something like this.

current_obs = array([10., 5., 1., 2., 4., 2., 1., 5., 3., 5., 3., 4., 1., 5., 4., 4., 0.])
state=policy.get_initial_state()
action, state = policy.compute_single_action(current_obs,state=state)

Thank you mannyv for your reply

After I tried, I get this error

What I should do next?

Opps I forgot to include info. Sorry about that. How about this:

current_obs = array([10., 5., 1., 2., 4., 2., 1., 5., 3., 5., 3., 4., 1., 5., 4., 4., 0.])
state=policy.get_initial_state()
action, state, info = policy.compute_single_action(current_obs,state=state)

Thank you very much @mannyv

I get the result finally.

@mannyv

I have a question for next action?
After Initial State, next action I only change current obs which according to changing environment, right? And for the state I need to keep and put it in the argument of the method? Or I need to use use state=policy.get_initial_state() all the time?

My Environment is (focus only 1 warehouse)

my_obs (will change everyday) = [inventory_of_factory, inventory_of_warehouse, demand_14th_timestep,demand_13th_timestep,demand_12th_timestep,…demand_1st_timestep,]

My actions = [#produce@factory, #deliver_to_warehouse]

Are you trying to evaluate a trained policy or do training on your policy?

The way you are doing it now is good for evaluation and debugging but it will not train the policy.

In general you (or the library of you are calling trainer.train() or tune.run()) are going to call get _initial_state() every time you call env.reset() which you would do when your environment returns done==True. On every step of the environment you will use the state returned from the previous compute_actions as the input state for the current call to compute_actions.

1 Like

I get it. Thank you very much @mannyv