Hey everyone,
i set up a small script for testing a new system. The script just runs a simple PPOTrainer with tune. However when i try to set num_gpus in the config to any value other than 0.0 i get an error:
“ERROR worker.py:428 – Exception raised in creation task: The actor died because of an error raised in its creation task”.
It seems that the error goes back to the session init of my tensorflow package.
Am i using num_gpus wrong? I’d be grateful if anyone could point me in the right direction!
Setup:
Ubuntu 20.04.3 LTS
Python 3.8.10
Intel Core i5-6600
GeForce RTX 2080 Ti
Packages (in python venv):
Tensorflow 2.6.0
Ray 1.7.0
apt packages:
cuda 11.2
cudnn 8.1
cudnn-dev 8.1
The script:
from ray import tune
from ray.rllib.agents.ppo import PPOTrainer
from ray.tune.logger import TBXLoggerCallback
config = {"env": "CartPole-v0",
"evaluation_num_episodes": 1000,
"num_gpus":1.0}
stop = {"episode_reward_mean": 180}
tune.run(PPOTrainer, config=config, stop=stop, callbacks=[TBXLoggerCallback()])
First status:
== Status ==
Memory usage on this node: 4.6/15.6 GiB
Using FIFO scheduling algorithm.
Resources requested: 3.0/4 CPUs, 1.0/1 GPUs, 0.0/7.57 GiB heap, 0.0/3.78 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /home/tim/ray_results/PPO_2021-10-21_16-08-34
Number of trials: 1/1 (1 RUNNING)
+-----------------------------+----------+-------+
| Trial name | status | loc |
|-----------------------------+----------+-------|
| PPO_CartPole-v0_6112e_00000 | RUNNING | |
+-----------------------------+----------+-------+
2021-10-21 16:08:40,164 ERROR trial_runner.py:846 -- Trial PPO_CartPole-v0_6112e_00000: Error processing event.
Traceback (most recent call last):
File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 812, in _process_trial
results = self.trial_executor.fetch_result(trial)
File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 767, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 89, in wrapper
return func(*args, **kwargs)
File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/worker.py", line 1623, in get
raise value
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::PPO.__init__() (pid=6830, ip=172.21.24.154)
File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 137, in __init__
Trainer.__init__(self, config, env, logger_creator)
File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 611, in __init__
super().__init__(config, logger_creator)
File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/tune/trainable.py", line 106, in __init__
self.setup(copy.deepcopy(self.config))
File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 147, in setup
super().setup(config)
File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 764, in setup
self._init(self.config, self.env_creator)
File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 171, in _init
self.workers = self._make_workers(
File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 846, in _make_workers
return WorkerSet(
File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 103, in __init__
self._local_worker = self._make_worker(
File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 399, in _make_worker
worker = cls(
File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 583, in __init__
self._build_policy_map(
File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1382, in _build_policy_map
self.policy_map.create_policy(name, orig_cls, obs_space, act_space,
File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/rllib/policy/policy_map.py", line 123, in create_policy
sess = self.session_creator()
File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 316, in session_creator
return tf1.Session(
File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1601, in __init__
super(Session, self).__init__(target, graph, config=config)
File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 711, in __init__
self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
MemoryError: std::bad_alloc
Result for PPO_CartPole-v0_6112e_00000:
{}
Second status:
== Status ==
Memory usage on this node: 4.1/15.6 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/4 CPUs, 0/1 GPUs, 0.0/7.81 GiB heap, 0.0/3.91 GiB objects (0.0/1.0 accelerator_type:RTX)
Result logdir: /home/tim/ray_results/PPO_2021-10-21_15-36-01
Number of trials: 1/1 (1 ERROR)
+-----------------------------+----------+-------+
| Trial name | status | loc |
|-----------------------------+----------+-------|
| PPO_CartPole-v0_d4a36_00000 | ERROR | |
+-----------------------------+----------+-------+
Number of errored trials: 1
+-----------------------------+--------------+-----------------------------------------------------------------------------------------------------------+
| Trial name | # failures | error file |
|-----------------------------+--------------+-----------------------------------------------------------------------------------------------------------|
| PPO_CartPole-v0_d4a36_00000 | 1 | /home/tim/ray_results/PPO_2021-10-21_15-36-01/PPO_CartPole-v0_d4a36_00000_0_2021-10-21_15-36-01/error.txt |
+-----------------------------+--------------+-----------------------------------------------------------------------------------------------------------+
(pid=5284) 2021-10-21 15:36:06,522 ERROR worker.py:428 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::PPO.__init__() (pid=5284, ip=172.21.24.154)
(pid=5284) File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 137, in __init__
(pid=5284) Trainer.__init__(self, config, env, logger_creator)
(pid=5284) File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 611, in __init__
(pid=5284) super().__init__(config, logger_creator)
(pid=5284) File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/tune/trainable.py", line 106, in __init__
(pid=5284) self.setup(copy.deepcopy(self.config))
(pid=5284) File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 147, in setup
(pid=5284) super().setup(config)
(pid=5284) File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 764, in setup
(pid=5284) self._init(self.config, self.env_creator)
(pid=5284) File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 171, in _init
(pid=5284) self.workers = self._make_workers(
(pid=5284) File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 846, in _make_workers
(pid=5284) return WorkerSet(
(pid=5284) File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 103, in __init__
(pid=5284) self._local_worker = self._make_worker(
(pid=5284) File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 399, in _make_worker
(pid=5284) worker = cls(
(pid=5284) File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 583, in __init__
(pid=5284) self._build_policy_map(
(pid=5284) File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1382, in _build_policy_map
(pid=5284) self.policy_map.create_policy(name, orig_cls, obs_space, act_space,
(pid=5284) File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/rllib/policy/policy_map.py", line 123, in create_policy
(pid=5284) sess = self.session_creator()
(pid=5284) File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 316, in session_creator
(pid=5284) return tf1.Session(
(pid=5284) File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 1601, in __init__
(pid=5284) super(Session, self).__init__(target, graph, config=config)
(pid=5284) File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/tensorflow/python/client/session.py", line 711, in __init__
(pid=5284) self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts)
(pid=5284) MemoryError: std::bad_alloc
Traceback (most recent call last):
File "ray_test.py", line 12, in <module>
tune.run(PPOTrainer, config=config, stop=stop, callbacks=[TBXLoggerCallback()]) # "log_level": "INFO" for verbose,
File "/home/tim/Documents/work/yog-sothoth/services/aiWorker/venv/lib/python3.8/site-packages/ray/tune/tune.py", line 611, in run
raise TuneError("Trials did not complete", incomplete_trials)
ray.tune.error.TuneError: ('Trials did not complete', [PPO_CartPole-v0_d4a36_00000])