Hi
Using PPO with this config:
ray.init(num_cpus=8, num_gpus=1,log_to_driver=True )
PPOConfig()
.resources(num_gpus=1, num_cpus_per_worker=0.1, num_gpus_per_worker=0)
.framework(framework="tf2", eager_tracing=True)
Probably there are many mistakes on those simple lines, I’m just a beginner trying to figure out how those parameters work. But nonetheless, I can’t get past this Error regardless of how many GPUs I set on those config parameters (0 or 1). All workers die with this error:
ERROR worker.py:763 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=12264, ip=127.0.0.1, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x000001F5D9225310>)
(RolloutWorker pid=12264) File "python\ray\_raylet.pyx", line 859, in ray._raylet.execute_task
(RolloutWorker pid=12264) File "python\ray\_raylet.pyx", line 863, in ray._raylet.execute_task
(RolloutWorker pid=12264) File "python\ray\_raylet.pyx", line 810, in ray._raylet.execute_task.function_executor
(RolloutWorker pid=12264) File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\_private\function_manager.py", line 674, in actor_method_executor
(RolloutWorker pid=12264) return method(__ray_actor, *args, **kwargs)
(RolloutWorker pid=12264) File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\util\tracing\tracing_helper.py", line 466, in _resume_span
(RolloutWorker pid=12264) return method(self, *_args, **_kwargs)
(RolloutWorker pid=12264) File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 625, in __init__
(RolloutWorker pid=12264) self._build_policy_map(
(RolloutWorker pid=12264) File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\util\tracing\tracing_helper.py", line 466, in _resume_span
(RolloutWorker pid=12264) return method(self, *_args, **_kwargs)
(RolloutWorker pid=12264) File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 1899, in _build_policy_map
(RolloutWorker pid=12264) self.policy_map.create_policy(
(RolloutWorker pid=12264) File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\policy\policy_map.py", line 134, in create_policy
(RolloutWorker pid=12264) policy = create_policy_for_framework(
(RolloutWorker pid=12264) File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\utils\policy.py", line 113, in create_policy_for_framework
(RolloutWorker pid=12264) return policy_class(observation_space, action_space, merged_config)
(RolloutWorker pid=12264) File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\policy\eager_tf_policy.py", line 154, in __init__
(RolloutWorker pid=12264) super(TracedEagerPolicy, self).__init__(*args, **kwargs)
(RolloutWorker pid=12264) File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\algorithms\ppo\ppo_tf_policy.py", line 102, in __init__
(RolloutWorker pid=12264) self.maybe_initialize_optimizer_and_loss()
(RolloutWorker pid=12264) File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\policy\eager_tf_policy_v2.py", line 428, in maybe_initialize_optimizer_and_loss
(RolloutWorker pid=12264) self._initialize_loss_from_dummy_batch(
(RolloutWorker pid=12264) File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\policy\policy.py", line 1248, in _initialize_loss_from_dummy_batch
(RolloutWorker pid=12264) actions, state_outs, extra_outs = self.compute_actions_from_input_dict(
(RolloutWorker pid=12264) File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\policy\eager_tf_policy.py", line 129, in _func
(RolloutWorker pid=12264) return obj(self_, *args, **kwargs)
(RolloutWorker pid=12264) File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\policy\eager_tf_policy.py", line 181, in compute_actions_from_input_dict
(RolloutWorker pid=12264) return super(TracedEagerPolicy, self).compute_actions_from_input_dict(
(RolloutWorker pid=12264) File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\policy\eager_tf_policy_v2.py", line 465, in compute_actions_from_input_dict
(RolloutWorker pid=12264) ret = self._compute_actions_helper(
(RolloutWorker pid=12264) File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\utils\threading.py", line 24, in wrapper
(RolloutWorker pid=12264) return func(self, *a, **k)
(RolloutWorker pid=12264) File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\policy\eager_tf_policy_v2.py", line 812, in _compute_actions_helper
(RolloutWorker pid=12264) dist_inputs, state_out = self.model(
(RolloutWorker pid=12264) File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\models\modelv2.py", line 259, in __call__
(RolloutWorker pid=12264) res = self.forward(restored, state or [], seq_lens)
(RolloutWorker pid=12264) File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\models\tf\complex_input_net.py", line 182, in forward
(RolloutWorker pid=12264) nn_out, _ = self.flatten[i](
(RolloutWorker pid=12264) File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\models\modelv2.py", line 259, in __call__
(RolloutWorker pid=12264) res = self.forward(restored, state or [], seq_lens)
(RolloutWorker pid=12264) File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\models\tf\fcnet.py", line 148, in forward
(RolloutWorker pid=12264) model_out, self._value_out = self.base_model(input_dict["obs_flat"])
(RolloutWorker pid=12264) File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler
(RolloutWorker pid=12264) raise e.with_traceback(filtered_tb) from None
(RolloutWorker pid=12264) File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\tensorflow\python\framework\ops.py", line 7209, in raise_from_not_ok_status
(RolloutWorker pid=12264) raise core._status_to_exception(e) from None # pylint: disable=protected-access
(RolloutWorker pid=12264) tensorflow.python.framework.errors_impl.InternalError: Exception encountered when calling layer "fc_value_1" " f"(type Dense).
(RolloutWorker pid=12264)
(RolloutWorker pid=12264) {{function_node __wrapped__MatMul_device_/job:localhost/replica:0/task:0/device:GPU:0}} Attempting to perform BLAS operation using StreamExecutor without BLAS support [Op:MatMul]
(RolloutWorker pid=12264)
(RolloutWorker pid=12264) Call arguments received by layer "fc_value_1" " f"(type Dense):
(RolloutWorker pid=12264) � inputs=tf.Tensor(shape=(32, 384), dtype=float32)
However, if I change to tensorflow 1:
framework(framework="tf")
then it works like a charm and my algo trains with my custom gym env
I found no related issues or questions within the Ray community and the only refference I found googling (tensorflow - "Attempting to perform BLAS operation using StreamExecutor without BLAS support" error occurs - Stack Overflow) is of no use
Any idea of what could be the issue?