How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hi, I am using Ray RLlib to train a multi-agent reinforcement learning model on Python. The environment is self-made. The task is to train agents to fight against each other in two groups.
I am training my model in a curriculum learning fashion, but I don’t use the callback function to set a new task during training. I change the task manually (this is just my preference).
My agents use a shared policy to train. The agents (opponents) from the other group are not trained, but are hardcoded agents. However, in my current task level (level 3), I am doing a kind of self-play. Concretely, in this level 3, the agents as well as the opponents start with a copy of the policy obtained in level 2, but the agent’s policy will be updated during training, whereas the policy of the opponent is freezed, so it just does inference. And during training in this configuration, I sometimes get an error from PyTorch. I will list you the most important code snippets used. And below you will see the error code that appears after around 6000 epochs. I suspect it’s because of the rewards, which might be 0 for several episodes.
I don’t know if it is the right way to set-up self-play (for ray rllib), but in the environment, I basicly load the whole set-up as for training and restore the algorithm from level 2. See in the code ENV below. The BasicEnv()
class is just a blank class to be able to set-up the algorithm inside the environment.
If more and detailed code is needed, let me know! I have Python 3.10.6, PyTorch 1.13.1 with cuda 11.7 and ray 2.0.0.
CODE:
config = {
"multiagent": {
"policies": {
"shared_policy": PolicySpec(
config={
"model": {
"fcnet_hiddens" : [HL1, HL2],
"fcnet_activation": "tanh",
"vf_share_layers": True,
}
}
)
},
"policy_mapping_fn": (
lambda agent_id, episode, **kwargs: "shared_policy"
),
},
"train_batch_size": 2000,
"rollout_fragment_length": 1000,
"gamma": 0.99,
"framework": "torch",
"horizon": HORIZON,
"lambda": 0.8,
"clip_param": 0.2,
"lr": 1e-4,
"num_workers": 4,
"num_gpus": 1,
}
algo = ppo.PPO(env=Dogfight, config=config)
algo.restore(path)
for i in range(10001):
result = algo.train()
ENV:
class BasicEnv(MultiAgentEnv)
def __init__(self,):
super().__init__()
def reset(self):
pass
def step():
pass
class Dogfight(MultiAgentEnv)
def __init__(self, env_config):
super().__init__()
...
self.algo = self.setup_ss()
def setup_ss(self):
ss_config = {
"multiagent": {
"policies": {
"shared_policy": PolicySpec(
config={
"model": {
"fcnet_hiddens" : [512, 512],
"fcnet_activation": "tanh",
"vf_share_layers": True,
}
}
)
},
"policy_mapping_fn": (
lambda agent_id, episode, **kwargs: "shared_policy"
),
},
"train_batch_size": 2000,
"rollout_fragment_length": 1000,
"gamma": 0.99,
"framework": "torch",
"horizon": self.horizon,
"lambda": 0.8,
"clip_param": 0.2,
"lr": 1e-4,
"num_workers": 1,
"explore": False
}
algo = ppo.PPO(env=BasicEnv, config=ss_config)
algo.restore(self.restore_path)
return algo
def step(self, actions):
...
opponent_state = self.opponent_state() #returns observation from opponent perspective
opponent_actions = self.algo.compute_single_action(observation=opponent_state, policy_id="shared_policy")
...
*apply actions for agents and opponents*
ERROR:
Traceback (most recent call last):
File "/home/ardianselmonaj/Projects/marl-warsim/train.py", line 847, in <module>
start_training(algo)
File "/home/ardianselmonaj/Projects/marl-warsim/train.py", line 809, in start_training
result = algo.train()
File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 347, in train
result = self.step()
File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 661, in step
results, train_iter_ctx = self._run_one_training_iteration()
File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 2378, in _run_one_training_iteration
num_recreated += self.try_recover_from_step_attempt(
File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 2190, in try_recover_from_step_attempt
raise error
File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 2373, in _run_one_training_iteration
results = self.training_step()
File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/algorithms/ppo/ppo.py", line 418, in training_step
train_results = train_one_step(self, train_batch)
File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/execution/train_ops.py", line 68, in train_one_step
info = do_minibatch_sgd(
File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/utils/sgd.py", line 129, in do_minibatch_sgd
local_worker.learn_on_batch(
File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/evaluation/rollout_worker.py", line 914, in learn_on_batch
info_out[pid] = policy.learn_on_batch(batch)
File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/utils/threading.py", line 24, in wrapper
return func(self, *a, **k)
File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/policy/torch_policy_v2.py", line 606, in learn_on_batch
grads, fetches = self.compute_gradients(postprocessed_batch)
File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/utils/threading.py", line 24, in wrapper
return func(self, *a, **k)
File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/policy/torch_policy_v2.py", line 789, in compute_gradients
tower_outputs = self._multi_gpu_parallel_grad_calc([postprocessed_batch])
File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/policy/torch_policy_v2.py", line 1179, in _multi_gpu_parallel_grad_calc
raise last_result[0] from last_result[1]
ValueError: Expected parameter logits (Tensor of shape (128, 13)) of distribution Categorical(logits: torch.Size([128, 13])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]], device='cuda:0',
grad_fn=<SubBackward0>)
tracebackTraceback (most recent call last):
File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/policy/torch_policy_v2.py", line 1095, in _worker
self.loss(model, self.dist_class, sample_batch)
File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/algorithms/ppo/ppo_torch_policy.py", line 87, in loss
curr_action_dist = dist_class(logits, model)
File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/models/torch/torch_action_dist.py", line 103, in __init__
self.cats = [
File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/ray/rllib/models/torch/torch_action_dist.py", line 104, in <listcomp>
torch.distributions.categorical.Categorical(logits=input_)
File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/torch/distributions/categorical.py", line 66, in __init__
super(Categorical, self).__init__(batch_shape, validate_args=validate_args)
File "/home/ardianselmonaj/Projects/marl-warsim/venv/lib/python3.10/site-packages/torch/distributions/distribution.py", line 56, in __init__
raise ValueError(
ValueError: Expected parameter logits (Tensor of shape (128, 13)) of distribution Categorical(logits: torch.Size([128, 13])) to satisfy the constraint IndependentConstraint(Real(), 1), but found invalid values:
tensor([[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]], device='cuda:0',
grad_fn=<SubBackward0>)
In tower 0 on device cuda:0