Some issues happened when I test the method of 'loss' in my 'Policy' class

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi, my python version is 3.7, my ray version is 2.7.1, my TensorFlow version is 2.11.0, and my operation system is Ubuntu 20.04.

I built a multi-agent environment that inherits from the ray.rllib.env.multi_agent_env.MultiAgentEnv, in which I set up three types of agents, namely ‘left’, ‘right’ and ‘tl’. It refers to two types of vehicles with different tasks and one type of traffic light. I’m going to use the ‘DQN’ algorithm where the action space of each class of agents is defined as the ‘gymnasio.box’ type.

For the above three types of agents, I built three models based on ray.rllib.models.tf.tf_modelv2.TFModelV2, and the output of the models was the 'q value 'prediction of the actions of the corresponding types of agents respectively. I wrote the same 'forward() method 'for each of the three models, which outputs the 'q value 'predicted by the model. In this case, I used self.model.predict_on_batch(inputs) to generate batch predictions for the model.

The ‘inputs’ in the code above is the ‘input_dict’ inputted from the ‘forward()’ method in my model class and, after some preprocessing by me, matches the size of my model input.

In addition, in my policy class, I override the ‘loss() method’ of the parent class, as follows:

    def loss(self, model, dist_class, train_batch):
        actions = train_batch[SampleBatch.ACTIONS]
        rewards = train_batch[SampleBatch.REWARDS]
        dones = train_batch[SampleBatch.DONES]
        next_obs = train_batch[SampleBatch.NEXT_OBS]

        # Compute Q values
        q_values = model(train_batch)
        q_value = tf.reduce_sum(tf.one_hot(actions, depth=q_values.shape[1]) * q_values, axis=1)

        # Compute next Q values
        next_q_values = model({SampleBatch.CUR_OBS: next_obs})
        next_q_value = tf.reduce_max(next_q_values, axis=1)

        # Compute target Q values
        target_q_value = rewards + (1 - dones) * self.config["gamma"] * next_q_value

        # Compute loss
        loss = tf.reduce_mean(tf.square(q_value - target_q_value))

        return loss

So here’s the problem. In the code: q_values = model(train_batch) calls the 'forward() method 'of the model class, and therefore the ‘forward() method’ of the model class executes the code above self.model.predict_on_batch(inputs) , and there is an error:

ERROR tune_controller.py:1502 -- Trial task failed for trial DQN_merge_env_87df3_00000
Traceback (most recent call last):
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/_private/worker.py", line 2547, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(LookupError): ray::DQN.train() (pid=1690783, ip=198.18.0.1, actor_id=7e918118b4b6f0e578af4f9301000000, repr=DQN)
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/tune/trainable/trainable.py", line 400, in train
    raise skipped from exception_cause(skipped)
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/tune/trainable/trainable.py", line 397, in train
    result = self.step()
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/rllib/algorithms/algorithm.py", line 853, in step
    results, train_iter_ctx = self._run_one_training_iteration()
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/rllib/algorithms/algorithm.py", line 2838, in _run_one_training_iteration
    results = self.training_step()
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/rllib/algorithms/dqn/dqn.py", line 447, in training_step
    train_results = train_one_step(self, train_batch)
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/rllib/execution/train_ops.py", line 70, in train_one_step
    info = local_worker.learn_on_batch(train_batch)
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/rllib/evaluation/rollout_worker.py", line 810, in learn_on_batch
    info_out[pid] = policy.learn_on_batch(batch)
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/rllib/utils/threading.py", line 24, in wrapper
    return func(self, *a, **k)
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 705, in learn_on_batch
    stats = self._learn_on_batch_helper(postprocessed_batch)
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 1095, in _learn_on_batch_helper
    grads_and_vars, _, stats = self._compute_gradients_helper(samples)
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/rllib/utils/threading.py", line 24, in wrapper
    return func(self, *a, **k)
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 1122, in _compute_gradients_helper
    losses = self.loss(self.model, self.dist_class, samples)
  File "/Pycharm/paper_2/tl_revised_ray/policy_left.py", line 122, in loss
    q_values = self.model.forward(obs, [], train_batch.SEQ_LENS)
  File "/Pycharm/paper_2/tl_revised_ray/left_model.py", line 103, in forward
    q_values = self.base_model.predict_on_batch(inputs)
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/keras/engine/training.py", line 2571, in predict_on_batch
    outputs = self.predict_function(iterator)
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/tensorflow/python/ops/gradients_util.py", line 640, in _GradientsHelper
    "No gradient defined for operation"
LookupError: No gradient defined for operation'IteratorGetNext' (op type: IteratorGetNext). In general every operation must have an associated `@tf.RegisterGradient` for correct autodiff, which this op is lacking. If you want to pretend this operation is a constant in your program, you may insert `tf.stop_gradient`. This can be useful to silence the error in cases where you know gradients are not needed, e.g. the forward pass of tf.custom_gradient. Please see more details in https://www.tensorflow.org/api_docs/python/tf/custom_gradient.

Strangely, this problem does not occur when I call the model not from the 'loss() method 'but from, say, the ‘compute_actions() method’ in the policy class.

In addition, I have tried to replace self.model.predict_on_batch(inputs) with self.model.predict(inputs) in the 'forward() method of my model class. Similarly, if it is called from the 'loss() method 'of the policy class, the following error occurs:

ERROR tune_controller.py:1502 -- Trial task failed for trial DQN_merge_env_ed17f_00000
Traceback (most recent call last):
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/_private/worker.py", line 2547, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::DQN.train() (pid=1838617, ip=198.18.0.1, actor_id=2a26de9e947b01699026a8ad01000000, repr=DQN)
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/tune/trainable/trainable.py", line 400, in train
    raise skipped from exception_cause(skipped)
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/tune/trainable/trainable.py", line 397, in train
    result = self.step()
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/rllib/algorithms/algorithm.py", line 853, in step
    results, train_iter_ctx = self._run_one_training_iteration()
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/rllib/algorithms/algorithm.py", line 2838, in _run_one_training_iteration
    results = self.training_step()
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/rllib/algorithms/dqn/dqn.py", line 447, in training_step
    train_results = train_one_step(self, train_batch)
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/rllib/execution/train_ops.py", line 70, in train_one_step
    info = local_worker.learn_on_batch(train_batch)
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/rllib/evaluation/rollout_worker.py", line 810, in learn_on_batch
    info_out[pid] = policy.learn_on_batch(batch)
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/rllib/utils/threading.py", line 24, in wrapper
    return func(self, *a, **k)
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 705, in learn_on_batch
    stats = self._learn_on_batch_helper(postprocessed_batch)
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 1095, in _learn_on_batch_helper
    grads_and_vars, _, stats = self._compute_gradients_helper(samples)
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/rllib/utils/threading.py", line 24, in wrapper
    return func(self, *a, **k)
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/rllib/policy/eager_tf_policy_v2.py", line 1122, in _compute_gradients_helper
    losses = self.loss(self.model, self.dist_class, samples)
  File "/Pycharm/paper_2/tl_revised_ray/policy_left.py", line 122, in loss
    q_values = self.model.forward(obs, [], train_batch.SEQ_LENS)
  File "/Pycharm/paper_2/tl_revised_ray/left_model.py", line 103, in forward
    q_values = self.base_model.predict(inputs)
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/anaconda/anaconda3/envs/flow-pyg/lib/python3.7/site-packages/ray/rllib/policy/eager_tf_policy.py", line 124, in _disallow_var_creation
    "model initialization: {}".format(v.name)
ValueError: Detected a variable being created during an eager forward pass. Variables should only be created during model initialization: Variable:0

These are my questions, and I would be grateful if you could give me some answers. :pray: