ValueError: Expected parameter logits (...) to satisfy the constraint IndependentConstraint(Real(), 1)

@LukasNothhelfer ,

looks like you are making some progress here. I would also check, if the gradient for the weights are becoming nan because the weights are. Furthermore,

  • what happens, if you reduce the learning rate?
  • Is the learning rate computed somewhere and became itself nan?
  • How does the loss behave? Is it growing, becoming nan?
  • If your loss contains a log and your initial input values are zero that could cause an nan value in the loss and later in the gradient.

Good luck

@Lars_Simon_Zehnder
My network is as follows: lin1 ā†’ lin2 ā†’ lstm. It looks like there is a problem with the linear layers. The gradients of the linear layers become NaN. The gradients of the lstm are fine:

pi.bias has an infinite grad, loss is [tensor(0.4693, grad_fn=<DivBackward0>)], grads are:
{
    "lin1.weight": {
        "max": NaN,
        "min": NaN,
        "mean": NaN,
        "std": NaN,
        "isfinite": false
    },
    "lin1.bias": {
        "max": NaN,
        "min": NaN,
        "mean": NaN,
        "std": NaN,
        "isfinite": false
    },
    "lin2.weight": {
        "max": NaN,
        "min": NaN,
        "mean": NaN,
        "std": NaN,
        "isfinite": false
    },
    "lin2.bias": {
        "max": NaN,
        "min": NaN,
        "mean": NaN,
        "std": NaN,
        "isfinite": false
    },
    "lstm.weight_ih_l0": {
        "max": 0.15106573700904846,
        "min": -0.08007017523050308,
        "mean": -0.00042382298852317035,
        "std": 0.007156638894230127,
        "isfinite": true
    },
    "lstm.weight_hh_l0": {
        "max": 0.014247420243918896,
        "min": -0.024576563388109207,
        "mean": 3.1564351957058534e-05,
        "std": 0.001239124103449285,
        "isfinite": true
    },
    "lstm.bias_ih_l0": {
        "max": 0.031573809683322906,
        "min": -0.01593855395913124,
        "mean": -0.0006381386774592102,
        "std": 0.004488547332584858,
        "isfinite": true
    },
    "lstm.bias_hh_l0": {
        "max": 0.031573813408613205,
        "min": -0.01593855582177639,
        "mean": -0.0006381386192515492,
        "std": 0.004488547332584858,
        "isfinite": true
    },
    "vf.weight": {
        "max": 0.038946326822042465,
        "min": -0.12820599973201752,
        "mean": -0.012392022646963596,
        "std": 0.040242999792099,
        "isfinite": true
    },
    "vf.bias": {
        "max": 0.09843102842569351,
        "min": 0.09843102842569351,
        "mean": 0.09843102842569351,
        "std": NaN,
        "isfinite": true
    },
    "pi.weight": {
        "max": 0.07048966735601425,
        "min": -0.0619414821267128,
        "mean": 3.055902197957039e-10,
        "std": 0.013486070558428764,
        "isfinite": true
    },
    "pi.bias": {
        "max": 0.07905567437410355,
        "min": -0.07213170826435089,
        "mean": 3.1044086745701804e-10,
        "std": 0.0533791221678257,
        "isfinite": true
    }
}

I guess I have to hook into the forward and backward pass of autograd to see what is going on.

@LukasNothhelfer,

How long is your max_seq_len?

Can you test with a max_seq_len of 1?

1 Like

@mannyv
My episodes are always the same length (episode_length=55). I used max_seq_len = 20 (default) all the time. Tbh I still dont really understand which impact max_seq_lens has on the learning and training and why it is there at all. I ll try with max_seq_lens=1 and let you know.

max_seq_len is rllibā€™s method for implementing truncated bptt.

1 Like

@mannyv Looks like it runs without errors now, when I use max_seq_lens=1. I just donā€™t quite understand why. A LSTM is exactly suitable for dealing with the problem of gradients in the BPTT. At least against vanishing gradients. So I have to assume that the gradients for my linear layers just exploded. I guess Iā€™ll have to look into max_seq_lens and truncated BPTT a bit more. Iā€™m not sure right now if I can leave max_seq_lens at 1. Can you tell me what negative or positive impact that would have in general if I left max_seq_lens at 1?

1 Like

@LukasNothhelfer ,

following the description in the catalog.py, max_seq_len is the maximum sequence length in the LSTM, meaning that all sequences are fit to this size.

@Lars_Simon_Zehnder Yes. And it looks like this parameter was causing the crash. I would like to know if I can leave it at 1 and what does it mean for my agent/learning/training progress if I leave it there.

@LukasNothhelfer ,

from my understanding and from looking into the rnn_sequencing.py this is the sequence length of the LSTM. So a max_seq_len=1 leads to a batch with time dimension of max_seq_len=1 that gets fed into your LSTM.

Hope this helps

Hi @LukasNothhelfer,

I did not consider that test to be a solution. It was a hunch I had about where the Nan were coming from.

A max_seq_len of 1 means the lstm would only learn over 1 timestep at a time. If that is the case why bother with an lstm.

There are three possible causes I would consider. Either it is based on the data from the environment (observations and rewards) or the model architecture, or the hyperparamters selected by tune.

What I would do at this point is take the setup you have and change the environment. There is a test environment in rllib called StatelessCartpole. Try running with your current config and switch in that environment to see if you get Nanā€™s. Many of us have used that environment to test our models with max_seq_lens larger than 20.

Out of curiosity, can you share the config that is causing the Nanā€™s?

Good luck

@mannyv Is that really right that the LSTM then makes no sense? The RNN states are nevertheless passed on by RLlib from time step to time step and influence the outputs of the network. The project I am working on is a bit more complex but I have tried to attach the main contents of my configuration below:

def make(file):
	
	data = pickle.load(open(file, "rb"))
	num_timesteps_per_day = data["drl"]["num-timesteps-per-day"]
	num_days = data["drl"]["num-days"]["train"]
	del data
	train_batch_size = 20 * num_timesteps_per_day  # 20 Tage (ca. 1 Monat)
	sgd_mini_batch_size = 1 * num_timesteps_per_day  # 1 Tag
	num_timesteps_total = num_days * num_timesteps_per_day
	num_train_iter_per_epoch = int(num_timesteps_total / train_batch_size)

	fixed_hparams = [
		
		# Ressourcen
		("framework", "torch"), ("log_level", "WARN"),
		("num_gpus", 0),  # keine GPUs fĆ¼r RL
		("num_workers", 0),  # nur der Driver
		("num_envs_per_worker", 1), 
		("num_cpus_per_worker", 0),
		("num_gpus_per_worker", 0),
		("custom_resources_per_worker", {}),
		("evaluation_num_workers", 0),  
		("num_cpus_for_driver", 1),  
		("create_env_on_driver", True),  
		
		("metrics_smoothing_episodes", 1),
		("batch_mode", "complete_episodes"),	
		("train_batch_size", train_batch_size),
		
		("sgd_minibatch_size", sgd_mini_batch_size),
		("num_sgd_iter", 30),
		
		("env", MarketEnvironment),
		(("env_config", "file"), os.path.join(
			"/workspace/data/final/", os.path.basename(file))),
		(("env_config", "mode"), "train"),
		
		("evaluation_interval", num_train_iter_per_epoch),
		("evaluation_num_episodes", 0), # siehe: custom_eval_function_ppo
		("custom_eval_function", custom_eval_function_ppo),
		("evaluation_config", {}),
		
		("callbacks", CustomMetricsCallback),
		(("model", "custom_model"), "MarketModel"),
		(("model", "vf_share_layers"), True),
		

		#(("model", "max_seq_len"), 20), # This gives me NaNs
		(("model", "max_seq_len"), 1), # This works
		("simple_optimizer", True),

		("grad_clip", 10.0)
	]
	
	
	variable_hparams = [
		
		# Seeds
		("seed", [0, 1, 2, 3]),  # 4 verschiedene pro Konfiguration
		
		(("env_config", "use_noop_action"), [False]),  # True, False
		(("env_config", "feature_normalization"), ["minmax"]),  # z, minmax
		(("env_config", "shuffle_episodes"), [True]), # True|False
		(("env_config", "features"), [

			# Standardfeatures
			[
				"mean",
				"open",
				"high",
				"low",
				"close-ask",
				"close-bid",
				"tickvol",
				"ma15",
				"ma30",
				"ma60",
				"ema15",
				"ema30",
				"ema60",
				"atr7",
				"atr14",
				"atr28",
				"rocp5",
				"rocp10",
				"rocp20",
				"macd6/13/5",
				"macd12/26/9",
				"macd24/52/18"
			],
			
			
		]),
		
		# Hyperparameter fĆ¼r das Modell
		(("model", "custom_model_config", "model"), [
			
			# # 16-32-32
			{
				"first_hidden_size": 16,
				"second_hidden_size": 32,
				"lstm_hidden_size": 32,
				"dropout_p": 0.05
			},
			{
				"first_hidden_size": 16,
				"second_hidden_size": 32,
				"lstm_hidden_size": 32,
				"dropout_p": 0.5
			},

			# 16-32-64
			{
				"first_hidden_size": 16,
				"second_hidden_size": 32,
				"lstm_hidden_size": 64,
				"dropout_p": 0.05
			},
			{
				"first_hidden_size": 16,
				"second_hidden_size": 32,
				"lstm_hidden_size": 64,
				"dropout_p": 0.5
			},
			
			# 32-32-64
			{
				"first_hidden_size": 32,
				"second_hidden_size": 32,
				"lstm_hidden_size": 64,
				"dropout_p": 0.05
			},
			{
				"first_hidden_size": 32,
				"second_hidden_size": 32,
				"lstm_hidden_size": 64,
				"dropout_p": 0.5
			},

			
			]),
		
		# Time2Vec
		("time2vec", [

			{
				"use_time2vec": False,
				"k": 16, "time_normalization":
				"z", "reference":
				"begin-week"
			}
		]),
		
		("optimizer", [{}]),
		("gamma", [0.8, 0.99]),
		("lambda", [0.6, 1.0]),
		("kl_coeff", [0.2]),
		("kl_target", [0.01]),
		("entropy_coeff", [0.0, 0.01]),
		("lr", [5e-5, 5e-4]),
		("vf_loss_coeff", [1.0, 1.5]),
		("clip_param", [0.3]),
		("vf_clip_param", [15.0]),
	]
	
	meta = {
		"num_train_iter_per_epoch": num_train_iter_per_epoch
	}
	
	return fixed_hparams, variable_hparams, meta

@LukasNothhelfer ,

what @mannyv means is that an LSTM with a max_seq_len does not make sense, as in this case the LSTM receives a sequence of length 1 - a single input instead of a sequence.

Try to set a break point at the line where the new_shape has been defined and see how the batch fed into the LSTM will be shaped. This will have a time dimension of 1 I guess.

@Lars_Simon_Zehnder
I donā€™t see any reason why the LSTM doesnā€™t make sense, since the output of the network depends, among other things, on the hidden states, which are generated either way, whether I put in a sequence or just a single time step. Of course, it would make no sense if the initial_state is always used for the first time step of a sequence, but this is not the case, since it is only used to start the episode. PyTorch even has an extra LSTM module that is there specifically to handle a single time step: LSTMCell

Hi @LukasNothhelfer,

Wow there is a lot there. Finding alpha requires so much work =).

What I was asking for, is do you have the specific configuration you used that is generating the NaNs?

Do you have a sense of whether it hapens for many configurations or just one or two of the set you have available to choose from?

@mannyv The completely confusing thing is that it has worked all this time and the error has only been showing up for about two weeks. Ok, down here is a single configuration that gives me NaNs (I deleted some of my environemnt parameters since they are unimportant here):

{'seed': 0,
 'model': {'custom_model_config': {'model': {'first_hidden_size': 16,
    'second_hidden_size': 32,
    'lstm_hidden_size': 32,
    'dropout_p': 0.05},
   'time2vec': {'use_time2vec': False,
    'k': 16,
    'time_normalization': 'z',
    'reference': 'begin-week'}},
  'custom_model': 'MarketModel',
  'vf_share_layers': True,
  'max_seq_len': 20},
 'optimizer': {},
 'gamma': 0.8,
 'lambda': 0.6,
 'kl_coeff': 0.2,
 'kl_target': 0.01,
 'entropy_coeff': 0.0,
 'lr': 5e-05,
 'vf_loss_coeff': 1.0,
 'clip_param': 0.3,
 'vf_clip_param': 15.0,
 'framework': 'torch',
 'log_level': 'WARN',
 'num_gpus': 0,
 'num_workers': 0,
 'num_envs_per_worker': 1,
 'num_cpus_per_worker': 0,
 'num_gpus_per_worker': 0,
 'custom_resources_per_worker': {},
 'evaluation_num_workers': 0,
 'num_cpus_for_driver': 1,
 'create_env_on_driver': True,
 'metrics_smoothing_episodes': 1,
 'batch_mode': 'complete_episodes',
 'train_batch_size': 1120,
 'sgd_minibatch_size': 56,
 'num_sgd_iter': 30,
 'env': environment.MarketEnvironment,
 'evaluation_interval': 20,
 'evaluation_num_episodes': 0,
 'custom_eval_function': <function evaluation.custom_eval_function_ppo(trainer: ray.rllib.agents.trainer.Trainer, eval_workers: ray.rllib.evaluation.worker_set.WorkerSet)>,
 'evaluation_config': {},
 'callbacks': callback.CustomMetricsCallback,
 'simple_optimizer': True,
 'grad_clip': 10.0,
 'rollout_fragment_length': 1120,
 'horizon': None,
 'soft_horizon': False,
 'no_done_at_end': False,
 'observation_space': None,
 'action_space': None,
 'remote_worker_envs': False,
 'remote_env_batch_wait_ms': 0,
 'env_task_fn': None,
 'render_env': False,
 'record_env': False,
 'clip_rewards': None,
 'normalize_actions': True,
 'clip_actions': False,
 'preprocessor_pref': 'deepmind',
 'ignore_worker_failures': False,
 'log_sys_usage': True,
 'fake_sampler': False,
 'eager_tracing': False,
 'explore': True,
 'exploration_config': {'type': 'StochasticSampling'},
 'evaluation_parallel_to_training': False,
 'in_evaluation': False,
 'sample_async': False,
 'sample_collector': ray.rllib.evaluation.collectors.simple_list_collector.SimpleListCollector,
 'observation_filter': 'NoFilter',
 'synchronize_filters': True,
 'tf_session_args': {'intra_op_parallelism_threads': 2,
  'inter_op_parallelism_threads': 2,
  'gpu_options': {'allow_growth': True},
  'log_device_placement': False,
  'device_count': {'CPU': 1},
  'allow_soft_placement': True},
 'local_tf_session_args': {'intra_op_parallelism_threads': 8,
  'inter_op_parallelism_threads': 8},
 'compress_observations': False,
 'collect_metrics_timeout': 180,
 'min_iter_time_s': 0,
 'timesteps_per_iteration': 0,
 'extra_python_environs_for_driver': {},
 'extra_python_environs_for_worker': {},
 '_fake_gpus': False,
 'placement_strategy': 'PACK',
 'input': 'sampler',
 'input_config': {},
 'actions_in_input_normalized': False,
 'input_evaluation': ['is', 'wis'],
 'postprocess_inputs': False,
 'shuffle_buffer_size': 0,
 'output': None,
 'output_compress_columns': ['obs', 'new_obs'],
 'output_max_file_size': 67108864,
 'multiagent': {'policies': {},
  'policy_map_capacity': 100,
  'policy_map_cache': None,
  'policy_mapping_fn': None,
  'policies_to_train': None,
  'observation_fn': None,
  'replay_mode': 'independent',
  'count_steps_by': 'env_steps'},
 'logger_config': None,
 '_tf_policy_handles_more_than_one_loss': False,
 '_disable_preprocessor_api': False,
 'monitor': -1,
 'use_critic': True,
 'use_gae': True,
 'shuffle_sequences': True,
 'lr_schedule': None,
 'entropy_coeff_schedule': None,
 'vf_share_layers': -1}

@mannyv Did it. I took my setting exactly and just changed the environment to the Stateless Cartpole environment (and deleted my own eval function, etc). I didnā€™t change anything that affects the training cycle, I just made the code executable. Same problem as before: I get NaNs:

Failure # 1 (occurred at 2021-11-18_01-37-38)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 890, in _process_trial
    results = self.trial_executor.fetch_result(trial)
  File "/opt/conda/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 788, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/opt/conda/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/ray/worker.py", line 1625, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): e[36mray::PPO.train_buffered()e[39m (pid=95, ip=10.1.8.250, repr=PPO)
  File "/opt/conda/lib/python3.7/site-packages/ray/rllib/agents/ppo/ppo_torch_policy.py", line 46, in ppo_surrogate_loss
    curr_action_dist = dist_class(logits, model)
  File "/opt/conda/lib/python3.7/site-packages/ray/rllib/models/torch/torch_action_dist.py", line 73, in __init__
    logits=self.inputs)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributions/categorical.py", line 64, in __init__
    super(Categorical, self).__init__(batch_shape, validate_args=validate_args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributions/distribution.py", line 56, in __init__
    f"Expected parameter {param} "
ValueError: Expected parameter logits (Tensor of shape (80, 2)) of distribution Categorical(logits: torch.Size([80, 2])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan]], grad_fn=<SubBackward0>)

The above exception was the direct cause of the following exception:

e[36mray::PPO.train_buffered()e[39m (pid=95, ip=10.1.8.250, repr=PPO)
  File "/opt/conda/lib/python3.7/site-packages/ray/tune/trainable.py", line 224, in train_buffered
    result = self.train()
  File "/opt/conda/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 682, in train
    raise e
  File "/opt/conda/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 668, in train
    result = Trainable.train(self)
  File "/opt/conda/lib/python3.7/site-packages/ray/tune/trainable.py", line 283, in train
    result = self.step()
  File "/opt/conda/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 206, in step
    step_results = next(self.train_exec_impl)
  File "/opt/conda/lib/python3.7/site-packages/ray/util/iter.py", line 756, in __next__
    return next(self.built_iterator)
  File "/opt/conda/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/opt/conda/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/opt/conda/lib/python3.7/site-packages/ray/util/iter.py", line 843, in apply_filter
    for item in it:
  File "/opt/conda/lib/python3.7/site-packages/ray/util/iter.py", line 843, in apply_filter
    for item in it:
  File "/opt/conda/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/opt/conda/lib/python3.7/site-packages/ray/util/iter.py", line 783, in apply_foreach
    for item in it:
  File "/opt/conda/lib/python3.7/site-packages/ray/util/iter.py", line 791, in apply_foreach
    result = fn(item)
  File "/opt/conda/lib/python3.7/site-packages/ray/rllib/execution/train_ops.py", line 69, in __call__
    }, lw, self.num_sgd_iter, self.sgd_minibatch_size, [])
  File "/opt/conda/lib/python3.7/site-packages/ray/rllib/utils/sgd.py", line 108, in do_minibatch_sgd
    }, minibatch.count)))[policy_id]
  File "/opt/conda/lib/python3.7/site-packages/ray/rllib/evaluation/rollout_worker.py", line 958, in learn_on_batch
    info_out[pid] = policy.learn_on_batch(batch)
  File "/opt/conda/lib/python3.7/site-packages/ray/rllib/utils/threading.py", line 21, in wrapper
    return func(self, *a, **k)
  File "/opt/conda/lib/python3.7/site-packages/ray/rllib/policy/torch_policy.py", line 507, in learn_on_batch
    grads, fetches = self.compute_gradients(postprocessed_batch)
  File "/opt/conda/lib/python3.7/site-packages/ray/rllib/policy/policy_template.py", line 336, in compute_gradients
    return parent_cls.compute_gradients(self, batch)
  File "/opt/conda/lib/python3.7/site-packages/ray/rllib/utils/threading.py", line 21, in wrapper
    return func(self, *a, **k)
  File "/opt/conda/lib/python3.7/site-packages/ray/rllib/policy/torch_policy.py", line 679, in compute_gradients
    [postprocessed_batch])
  File "/opt/conda/lib/python3.7/site-packages/ray/rllib/policy/torch_policy.py", line 1052, in _multi_gpu_parallel_grad_calc
    raise last_result[0] from last_result[1]
ValueError: Expected parameter logits (Tensor of shape (80, 2)) of distribution Categorical(logits: torch.Size([80, 2])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan],
        [nan, nan]], grad_fn=<SubBackward0>)
In tower 0 on device cpu

The configuration is the same as I posted in my previous post.

@LukasNothhelfer @mannyv I also had same issue but now it is rectified, the reason is that in your configuration if the learning rate is less than 0.1 it creates this issue. still not sure how learning rate is producing the NAN in the observation tensor. If anyone who knows about it please do share the answer, it will be helpful.

Thank you!

1 Like

I got same error:

Expected parameter logits (Tensor of shape (344, 6)) of distribution Categorical(logits: torch.Size([344, 6])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan],
        ...,
        [nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan]], device='cuda:0',
       grad_fn=<SubBackward0>)

what really is weird is I got this error on iter = 1825 not first iteration. I used LSTM in PPO.
does anyone have solutions?

anyways, I found this in pytorch discuss, it maybe helps:
Categorical distribution returning breaking - PyTorch Forums