BLAS Error on worker only when using TF2

Hi

Using PPO with this config:

ray.init(num_cpus=8, num_gpus=1,log_to_driver=True )
 PPOConfig()
.resources(num_gpus=1, num_cpus_per_worker=0.1, num_gpus_per_worker=0)
.framework(framework="tf2", eager_tracing=True)

Probably there are many mistakes on those simple lines, I’m just a beginner trying to figure out how those parameters work. But nonetheless, I can’t get past this Error regardless of how many GPUs I set on those config parameters (0 or 1). All workers die with this error:


ERROR worker.py:763 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=12264, ip=127.0.0.1, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x000001F5D9225310>)
(RolloutWorker pid=12264)   File "python\ray\_raylet.pyx", line 859, in ray._raylet.execute_task
(RolloutWorker pid=12264)   File "python\ray\_raylet.pyx", line 863, in ray._raylet.execute_task
(RolloutWorker pid=12264)   File "python\ray\_raylet.pyx", line 810, in ray._raylet.execute_task.function_executor
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\_private\function_manager.py", line 674, in actor_method_executor
(RolloutWorker pid=12264)     return method(__ray_actor, *args, **kwargs)
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\util\tracing\tracing_helper.py", line 466, in _resume_span
(RolloutWorker pid=12264)     return method(self, *_args, **_kwargs)
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 625, in __init__
(RolloutWorker pid=12264)     self._build_policy_map(
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\util\tracing\tracing_helper.py", line 466, in _resume_span
(RolloutWorker pid=12264)     return method(self, *_args, **_kwargs)
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 1899, in _build_policy_map
(RolloutWorker pid=12264)     self.policy_map.create_policy(
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\policy\policy_map.py", line 134, in create_policy
(RolloutWorker pid=12264)     policy = create_policy_for_framework(
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\utils\policy.py", line 113, in create_policy_for_framework
(RolloutWorker pid=12264)     return policy_class(observation_space, action_space, merged_config)
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\policy\eager_tf_policy.py", line 154, in __init__
(RolloutWorker pid=12264)     super(TracedEagerPolicy, self).__init__(*args, **kwargs)
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\algorithms\ppo\ppo_tf_policy.py", line 102, in __init__
(RolloutWorker pid=12264)     self.maybe_initialize_optimizer_and_loss()
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\policy\eager_tf_policy_v2.py", line 428, in maybe_initialize_optimizer_and_loss
(RolloutWorker pid=12264)     self._initialize_loss_from_dummy_batch(
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\policy\policy.py", line 1248, in _initialize_loss_from_dummy_batch
(RolloutWorker pid=12264)     actions, state_outs, extra_outs = self.compute_actions_from_input_dict(
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\policy\eager_tf_policy.py", line 129, in _func
(RolloutWorker pid=12264)     return obj(self_, *args, **kwargs)
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\policy\eager_tf_policy.py", line 181, in compute_actions_from_input_dict
(RolloutWorker pid=12264)     return super(TracedEagerPolicy, self).compute_actions_from_input_dict(
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\policy\eager_tf_policy_v2.py", line 465, in compute_actions_from_input_dict
(RolloutWorker pid=12264)     ret = self._compute_actions_helper(
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\utils\threading.py", line 24, in wrapper
(RolloutWorker pid=12264)     return func(self, *a, **k)
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\policy\eager_tf_policy_v2.py", line 812, in _compute_actions_helper
(RolloutWorker pid=12264)     dist_inputs, state_out = self.model(
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\models\modelv2.py", line 259, in __call__
(RolloutWorker pid=12264)     res = self.forward(restored, state or [], seq_lens)
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\models\tf\complex_input_net.py", line 182, in forward
(RolloutWorker pid=12264)     nn_out, _ = self.flatten[i](
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\models\modelv2.py", line 259, in __call__
(RolloutWorker pid=12264)     res = self.forward(restored, state or [], seq_lens)
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\models\tf\fcnet.py", line 148, in forward
(RolloutWorker pid=12264)     model_out, self._value_out = self.base_model(input_dict["obs_flat"])
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler
(RolloutWorker pid=12264)     raise e.with_traceback(filtered_tb) from None
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\tensorflow\python\framework\ops.py", line 7209, in raise_from_not_ok_status
(RolloutWorker pid=12264)     raise core._status_to_exception(e) from None  # pylint: disable=protected-access
(RolloutWorker pid=12264) tensorflow.python.framework.errors_impl.InternalError: Exception encountered when calling layer "fc_value_1" "                 f"(type Dense).
(RolloutWorker pid=12264)
(RolloutWorker pid=12264) {{function_node __wrapped__MatMul_device_/job:localhost/replica:0/task:0/device:GPU:0}} Attempting to perform BLAS operation using StreamExecutor without BLAS support [Op:MatMul]
(RolloutWorker pid=12264)
(RolloutWorker pid=12264) Call arguments received by layer "fc_value_1" "                 f"(type Dense):
(RolloutWorker pid=12264)   � inputs=tf.Tensor(shape=(32, 384), dtype=float32)

However, if I change to tensorflow 1:

framework(framework="tf")

then it works like a charm and my algo trains with my custom gym env

I found no related issues or questions within the Ray community and the only refference I found googling (tensorflow - "Attempting to perform BLAS operation using StreamExecutor without BLAS support" error occurs - Stack Overflow) is of no use

Any idea of what could be the issue?

I upgraded tensroflow version (duh…) and that fixed it
Now I only have to deal with a new bunch of deprecation warnings but at least it’s working
Thanks

1 Like

Actually my versions were all messed up, I reinstalled the conda environment with current versions of several packages and the error came back again. For refference in case anyone finds this error in the future: I finally found someone with similar error on Nvidia forums, as it turns out, my NVIDIA driver was outdated. After updating it, the error went away for good.
Thanks

Hi Prejan,
as I am also using NVIDIA and facing exactly the same error message with rllib==2.3 and tensorflow==2.10.1, can you tell me which NVIDIA driver version was working for you?
Thanks!

In general, regarding the original error message, it may also be the reason that the rllib config is not properly set in terms of num_gpus and num_gpus_per_worker.

Hello @PhilippWillms

Very sorry I didn’t answer your question. I’ve been bugged up with too many things and didn’t came back to the forum for a while.

As of now you will have probably sorted this out already, but just in case you haven’t or that anyone else needs this answer:

Nvidia GeForce GTX 1050
Driver version: 528.02

Hope this helps