Hi,
ray 2.1.0 on MacOS M1, using a standard PPO algo with a custom environment, hangs during training. The issue is inconsistent, but it happens quite often.
Here is the code snippet, pretty standard code:
init=ray.init(num_cpus=5, num_gpus=1, ignore_reinit_error=True, include_dashboard=True)
ppo_config={
"env": MyEnv,
"env_config": {},
"num_workers": 4,
"num_gpus": 1,
"ignore_worker_failures": True,
"recreate_failed_workers": True,
"framework": "tf2", "eager_tracing": True,
'normalize_actions': False,
'keep_per_episode_custom_metrics': True,
"horizon": 1000,
"batch_mode": "complete_episodes",
"train_batch_size": 4000,
"rollout_fragment_length": 1000,
"observation_filter": "MeanStdFilter",
"evaluation_num_workers": 1,
"evaluation_interval": 1000,
"evaluation_duration": 1,
"evaluation_duration_unit": 'episodes',
"evaluation_config": {
"render_env": False,
},
"log_level": "WARN"
}
algo = ppo.PPO(config=ppo_config)
TRAIN_ITER=1000
for i in range(TRAIN_ITER):
result=algo.train()
when it happens, if i break the execution i get this traceback:
Traceback (most recent call last):
File "/Users/user/Desktop/PyCharm Projects/Project/RLlib-PPO.py", line 96, in <module>
result=algo.train()
File "/Users/user/miniforge3/envs/RLlib/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 352, in train
result = self.step()
File "/Users/user/miniforge3/envs/RLlib/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 772, in step
results, train_iter_ctx = self._run_one_training_iteration()
File "/Users/user/miniforge3/envs/RLlib/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 2948, in _run_one_training_iteration
results = self.training_step()
File "/Users/user/miniforge3/envs/RLlib/lib/python3.10/site-packages/ray/rllib/algorithms/ppo/ppo.py", line 408, in training_step
train_batch = synchronous_parallel_sample(
File "/Users/user/miniforge3/envs/RLlib/lib/python3.10/site-packages/ray/rllib/execution/rollout_ops.py", line 100, in synchronous_parallel_sample
sample_batches = ray.get(
File "/Users/user/miniforge3/envs/RLlib/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/Users/user/miniforge3/envs/RLlib/lib/python3.10/site-packages/ray/_private/worker.py", line 2283, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/Users/user/miniforge3/envs/RLlib/lib/python3.10/site-packages/ray/_private/worker.py", line 668, in get_objects
data_metadata_pairs = self.core_worker.get_objects(
File "python/ray/_raylet.pyx", line 1445, in ray._raylet.CoreWorker.get_objects
File "python/ray/_raylet.pyx", line 190, in ray._raylet.check_status
KeyboardInterrupt
I do have access to the Ray Dashboard, but I am not sure on where to go about it.
Thanks,
Antonio