How to troubleshoot hang during a train rollout?

antonioaa1979 · November 15, 2022, 9:34pm

Hi,

ray 2.1.0 on MacOS M1, using a standard PPO algo with a custom environment, hangs during training. The issue is inconsistent, but it happens quite often.

Here is the code snippet, pretty standard code:

init=ray.init(num_cpus=5, num_gpus=1, ignore_reinit_error=True, include_dashboard=True)

ppo_config={
    "env": MyEnv,
    "env_config": {},
    "num_workers": 4,
    "num_gpus": 1,
    "ignore_worker_failures": True,
    "recreate_failed_workers": True,
    "framework": "tf2", "eager_tracing": True,
    'normalize_actions': False,
    'keep_per_episode_custom_metrics': True,
    "horizon": 1000,
    "batch_mode": "complete_episodes",
    "train_batch_size": 4000, 
    "rollout_fragment_length": 1000,
    "observation_filter": "MeanStdFilter",
    "evaluation_num_workers": 1,
    "evaluation_interval": 1000,
    "evaluation_duration": 1,
    "evaluation_duration_unit": 'episodes',
    "evaluation_config": {
        "render_env": False,
    },
    "log_level": "WARN"
}

algo = ppo.PPO(config=ppo_config)

TRAIN_ITER=1000
for i in range(TRAIN_ITER):
    result=algo.train()

when it happens, if i break the execution i get this traceback:

Traceback (most recent call last):
  File "/Users/user/Desktop/PyCharm Projects/Project/RLlib-PPO.py", line 96, in <module>
    result=algo.train()
  File "/Users/user/miniforge3/envs/RLlib/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 352, in train
    result = self.step()
  File "/Users/user/miniforge3/envs/RLlib/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 772, in step
    results, train_iter_ctx = self._run_one_training_iteration()
  File "/Users/user/miniforge3/envs/RLlib/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 2948, in _run_one_training_iteration
    results = self.training_step()
  File "/Users/user/miniforge3/envs/RLlib/lib/python3.10/site-packages/ray/rllib/algorithms/ppo/ppo.py", line 408, in training_step
    train_batch = synchronous_parallel_sample(
  File "/Users/user/miniforge3/envs/RLlib/lib/python3.10/site-packages/ray/rllib/execution/rollout_ops.py", line 100, in synchronous_parallel_sample
    sample_batches = ray.get(
  File "/Users/user/miniforge3/envs/RLlib/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/Users/user/miniforge3/envs/RLlib/lib/python3.10/site-packages/ray/_private/worker.py", line 2283, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/Users/user/miniforge3/envs/RLlib/lib/python3.10/site-packages/ray/_private/worker.py", line 668, in get_objects
    data_metadata_pairs = self.core_worker.get_objects(
  File "python/ray/_raylet.pyx", line 1445, in ray._raylet.CoreWorker.get_objects
  File "python/ray/_raylet.pyx", line 190, in ray._raylet.check_status
KeyboardInterrupt

I do have access to the Ray Dashboard, but I am not sure on where to go about it.

Thanks,
Antonio

antonioaa1979 · November 15, 2022, 9:36pm

antonioaa1979 · November 15, 2022, 9:38pm

antonioaa1979 · November 15, 2022, 9:40pm

antonioaa1979 · November 24, 2022, 3:47pm

just to help someone else in my same situation… i think the crash was in tensorflow-metal 0.6.0 package… after uninstalling it I no longer see hungs, and actually the training is behaving very differently (more consistently with my expectations)… apparently the metal acceleration was messing up the entire process

Topic		Replies	Views
Training keeps getting stuck Debugging and performance tuning	3	1216	May 25, 2023
DD-PPO RolloutWorker Hangs RLlib	3	630	May 27, 2021
Confusion migrating to new API RLlib	5	178	February 21, 2025
PPO.train incorrect result RLlib	1	256	May 23, 2023
TorchTrainer hangs when only 1 worker raises error	15	1028	November 2, 2022

How to troubleshoot hang during a train rollout?

Related topics