I want to test different custom models of PPO. Then, i hope to run two PPO in parallel or insequence. How can i do?
One example code is as follows. But it ends with errors. a1.stop() or ray.shutdown() can not help.
software versions:
ray and rllib: 2.0.0
python: 3.8
pytorch: 1.10.2
import ray
from ray.rllib.algorithms.ppo import PPO
config = {
"framework": "torch",
"env": "CartPole-v0",
"gamma": 0.0,
}
ray.init(include_dashboard=False)
a1 = PPO(config = config)
res1 = a1.train()
print("a1 res")
# a1.stop()
# ray.shutdown()
a2 = PPO(config = config)
res2 = a2.train()
print("a2 res")
Error:
2022-10-10 17:25:05,752 WARNING worker.py:1829 – The node with node id: 67b8ed4a54504877a25602a5aba40d4ad4ca1daff2fea7e9f7e6009b and address: 127.0.0.1 and node name: 127.0.0.1 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, preempted node, etc.)
(2) raylet has lagging heartbeats due to slow network or busy workload.
Traceback (most recent call last):
File “E:\workspace-bupt\06-offload-sdag\0-code\test_rllib.py”, line 16, in
a2 = PPO(config = config)
File “E:\ProgramData\Anaconda3\envs\rllib\lib\site-packages\ray\rllib\algorithms\algorithm.py”, line 308, in init
super().init(config=config, logger_creator=logger_creator, **kwargs)
File “E:\ProgramData\Anaconda3\envs\rllib\lib\site-packages\ray\tune\trainable\trainable.py”, line 157, in init
self.setup(copy.deepcopy(self.config))
File “E:\ProgramData\Anaconda3\envs\rllib\lib\site-packages\ray\rllib\algorithms\algorithm.py”, line 446, in setup
raise e
File “E:\ProgramData\Anaconda3\envs\rllib\lib\site-packages\ray\rllib\algorithms\algorithm.py”, line 418, in setup
self.workers = WorkerSet(
File “E:\ProgramData\Anaconda3\envs\rllib\lib\site-packages\ray\rllib\evaluation\worker_set.py”, line 125, in init
self.add_workers(
File “E:\ProgramData\Anaconda3\envs\rllib\lib\site-packages\ray\rllib\evaluation\worker_set.py”, line 269, in add_workers
self.foreach_worker(lambda w: w.assert_healthy())
File “E:\ProgramData\Anaconda3\envs\rllib\lib\site-packages\ray\rllib\evaluation\worker_set.py”, line 391, in foreach_worker
remote_results = ray.get([w.apply.remote(func) for w in self.remote_workers()])
File “E:\ProgramData\Anaconda3\envs\rllib\lib\site-packages\ray_private\client_mode_hook.py”, line 105, in wrapper
return func(*args, **kwargs)
File “E:\ProgramData\Anaconda3\envs\rllib\lib\site-packages\ray_private\worker.py”, line 2277, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: RolloutWorker
actor_id: 1519d88f54c329d1a94af12201000000
pid: 15076
namespace: fd8c3f60-1ea0-4852-80f4-473d2d605a4d
ip: 127.0.0.1
The actor is dead because its owner has died. Owner Id: 01000000ffffffffffffffffffffffffffffffffffffffffffffffff Owner Ip address: 127.0.0.1 Owner worker exit type: SYSTEM_ERROR Worker exit detail: Owner’s node has crashed.
The actor never ran - it was cancelled before it started running.
Process finished with exit code 1