How to troubleshoot hang during a train rollout?

Hi,

ray 2.1.0 on MacOS M1, using a standard PPO algo with a custom environment, hangs during training. The issue is inconsistent, but it happens quite often.

Here is the code snippet, pretty standard code:

init=ray.init(num_cpus=5, num_gpus=1, ignore_reinit_error=True, include_dashboard=True)

ppo_config={
    "env": MyEnv,
    "env_config": {},
    "num_workers": 4,
    "num_gpus": 1,
    "ignore_worker_failures": True,
    "recreate_failed_workers": True,
    "framework": "tf2", "eager_tracing": True,
    'normalize_actions': False,
    'keep_per_episode_custom_metrics': True,
    "horizon": 1000,
    "batch_mode": "complete_episodes",
    "train_batch_size": 4000, 
    "rollout_fragment_length": 1000,
    "observation_filter": "MeanStdFilter",
    "evaluation_num_workers": 1,
    "evaluation_interval": 1000,
    "evaluation_duration": 1,
    "evaluation_duration_unit": 'episodes',
    "evaluation_config": {
        "render_env": False,
    },
    "log_level": "WARN"
}

algo = ppo.PPO(config=ppo_config)

TRAIN_ITER=1000
for i in range(TRAIN_ITER):
    result=algo.train()

when it happens, if i break the execution i get this traceback:

Traceback (most recent call last):
  File "/Users/user/Desktop/PyCharm Projects/Project/RLlib-PPO.py", line 96, in <module>
    result=algo.train()
  File "/Users/user/miniforge3/envs/RLlib/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 352, in train
    result = self.step()
  File "/Users/user/miniforge3/envs/RLlib/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 772, in step
    results, train_iter_ctx = self._run_one_training_iteration()
  File "/Users/user/miniforge3/envs/RLlib/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 2948, in _run_one_training_iteration
    results = self.training_step()
  File "/Users/user/miniforge3/envs/RLlib/lib/python3.10/site-packages/ray/rllib/algorithms/ppo/ppo.py", line 408, in training_step
    train_batch = synchronous_parallel_sample(
  File "/Users/user/miniforge3/envs/RLlib/lib/python3.10/site-packages/ray/rllib/execution/rollout_ops.py", line 100, in synchronous_parallel_sample
    sample_batches = ray.get(
  File "/Users/user/miniforge3/envs/RLlib/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/Users/user/miniforge3/envs/RLlib/lib/python3.10/site-packages/ray/_private/worker.py", line 2283, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/Users/user/miniforge3/envs/RLlib/lib/python3.10/site-packages/ray/_private/worker.py", line 668, in get_objects
    data_metadata_pairs = self.core_worker.get_objects(
  File "python/ray/_raylet.pyx", line 1445, in ray._raylet.CoreWorker.get_objects
  File "python/ray/_raylet.pyx", line 190, in ray._raylet.check_status
KeyboardInterrupt

I do have access to the Ray Dashboard, but I am not sure on where to go about it.

Thanks,
Antonio

just to help someone else in my same situation… i think the crash was in tensorflow-metal 0.6.0 package… after uninstalling it I no longer see hungs, and actually the training is behaving very differently (more consistently with my expectations)… apparently the metal acceleration was messing up the entire process

1 Like