BLAS Error on worker only when using TF2

PREJAN · December 30, 2022, 6:38am

Hi

Using PPO with this config:

ray.init(num_cpus=8, num_gpus=1,log_to_driver=True )
 PPOConfig()
.resources(num_gpus=1, num_cpus_per_worker=0.1, num_gpus_per_worker=0)
.framework(framework="tf2", eager_tracing=True)

Probably there are many mistakes on those simple lines, I’m just a beginner trying to figure out how those parameters work. But nonetheless, I can’t get past this Error regardless of how many GPUs I set on those config parameters (0 or 1). All workers die with this error:


ERROR worker.py:763 -- Exception raised in creation task: The actor died because of an error raised in its creation task, ray::RolloutWorker.__init__() (pid=12264, ip=127.0.0.1, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x000001F5D9225310>)
(RolloutWorker pid=12264)   File "python\ray\_raylet.pyx", line 859, in ray._raylet.execute_task
(RolloutWorker pid=12264)   File "python\ray\_raylet.pyx", line 863, in ray._raylet.execute_task
(RolloutWorker pid=12264)   File "python\ray\_raylet.pyx", line 810, in ray._raylet.execute_task.function_executor
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\_private\function_manager.py", line 674, in actor_method_executor
(RolloutWorker pid=12264)     return method(__ray_actor, *args, **kwargs)
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\util\tracing\tracing_helper.py", line 466, in _resume_span
(RolloutWorker pid=12264)     return method(self, *_args, **_kwargs)
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 625, in __init__
(RolloutWorker pid=12264)     self._build_policy_map(
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\util\tracing\tracing_helper.py", line 466, in _resume_span
(RolloutWorker pid=12264)     return method(self, *_args, **_kwargs)
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\evaluation\rollout_worker.py", line 1899, in _build_policy_map
(RolloutWorker pid=12264)     self.policy_map.create_policy(
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\policy\policy_map.py", line 134, in create_policy
(RolloutWorker pid=12264)     policy = create_policy_for_framework(
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\utils\policy.py", line 113, in create_policy_for_framework
(RolloutWorker pid=12264)     return policy_class(observation_space, action_space, merged_config)
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\policy\eager_tf_policy.py", line 154, in __init__
(RolloutWorker pid=12264)     super(TracedEagerPolicy, self).__init__(*args, **kwargs)
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\algorithms\ppo\ppo_tf_policy.py", line 102, in __init__
(RolloutWorker pid=12264)     self.maybe_initialize_optimizer_and_loss()
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\policy\eager_tf_policy_v2.py", line 428, in maybe_initialize_optimizer_and_loss
(RolloutWorker pid=12264)     self._initialize_loss_from_dummy_batch(
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\policy\policy.py", line 1248, in _initialize_loss_from_dummy_batch
(RolloutWorker pid=12264)     actions, state_outs, extra_outs = self.compute_actions_from_input_dict(
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\policy\eager_tf_policy.py", line 129, in _func
(RolloutWorker pid=12264)     return obj(self_, *args, **kwargs)
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\policy\eager_tf_policy.py", line 181, in compute_actions_from_input_dict
(RolloutWorker pid=12264)     return super(TracedEagerPolicy, self).compute_actions_from_input_dict(
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\policy\eager_tf_policy_v2.py", line 465, in compute_actions_from_input_dict
(RolloutWorker pid=12264)     ret = self._compute_actions_helper(
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\utils\threading.py", line 24, in wrapper
(RolloutWorker pid=12264)     return func(self, *a, **k)
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\policy\eager_tf_policy_v2.py", line 812, in _compute_actions_helper
(RolloutWorker pid=12264)     dist_inputs, state_out = self.model(
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\models\modelv2.py", line 259, in __call__
(RolloutWorker pid=12264)     res = self.forward(restored, state or [], seq_lens)
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\models\tf\complex_input_net.py", line 182, in forward
(RolloutWorker pid=12264)     nn_out, _ = self.flatten[i](
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\models\modelv2.py", line 259, in __call__
(RolloutWorker pid=12264)     res = self.forward(restored, state or [], seq_lens)
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\ray\rllib\models\tf\fcnet.py", line 148, in forward
(RolloutWorker pid=12264)     model_out, self._value_out = self.base_model(input_dict["obs_flat"])
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler
(RolloutWorker pid=12264)     raise e.with_traceback(filtered_tb) from None
(RolloutWorker pid=12264)   File "C:\ProgramData\Anaconda3\envs\rlenv\lib\site-packages\tensorflow\python\framework\ops.py", line 7209, in raise_from_not_ok_status
(RolloutWorker pid=12264)     raise core._status_to_exception(e) from None  # pylint: disable=protected-access
(RolloutWorker pid=12264) tensorflow.python.framework.errors_impl.InternalError: Exception encountered when calling layer "fc_value_1" "                 f"(type Dense).
(RolloutWorker pid=12264)
(RolloutWorker pid=12264) {{function_node __wrapped__MatMul_device_/job:localhost/replica:0/task:0/device:GPU:0}} Attempting to perform BLAS operation using StreamExecutor without BLAS support [Op:MatMul]
(RolloutWorker pid=12264)
(RolloutWorker pid=12264) Call arguments received by layer "fc_value_1" "                 f"(type Dense):
(RolloutWorker pid=12264)   � inputs=tf.Tensor(shape=(32, 384), dtype=float32)

However, if I change to tensorflow 1:

framework(framework="tf")

then it works like a charm and my algo trains with my custom gym env

I found no related issues or questions within the Ray community and the only refference I found googling (tensorflow - "Attempting to perform BLAS operation using StreamExecutor without BLAS support" error occurs - Stack Overflow) is of no use

Any idea of what could be the issue?

PREJAN · December 31, 2022, 5:57am

I upgraded tensroflow version (duh…) and that fixed it
Now I only have to deal with a new bunch of deprecation warnings but at least it’s working
Thanks

PREJAN · January 10, 2023, 6:24am

Actually my versions were all messed up, I reinstalled the conda environment with current versions of several packages and the error came back again. For refference in case anyone finds this error in the future: I finally found someone with similar error on Nvidia forums, as it turns out, my NVIDIA driver was outdated. After updating it, the error went away for good.
Thanks

PhilippWillms · March 13, 2023, 9:07pm

Hi Prejan,
as I am also using NVIDIA and facing exactly the same error message with rllib==2.3 and tensorflow==2.10.1, can you tell me which NVIDIA driver version was working for you?
Thanks!

PhilippWillms · March 13, 2023, 10:10pm

In general, regarding the original error message, it may also be the reason that the rllib config is not properly set in terms of num_gpus and num_gpus_per_worker.

PREJAN · May 8, 2023, 8:15pm

Hello @PhilippWillms

Very sorry I didn’t answer your question. I’ve been bugged up with too many things and didn’t came back to the forum for a while.

As of now you will have probably sorted this out already, but just in case you haven’t or that anyone else needs this answer:

Nvidia GeForce GTX 1050
Driver version: 528.02

Hope this helps

Topic		Replies	Views
Tf2 error with LSTM but not with torch framework Configure Algorithm, Training, Evaluation, Scaling	0	112	May 16, 2024
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat1 in method wrapper_addmm) RLlib	4	2889	August 8, 2022
A RolloutWorker died computing advantages RLlib	0	27	July 31, 2024
Error when running on GPU RLlib	9	2269	February 23, 2022
Questions about using GPU for the ray[rllib] RLlib	4	2015	August 4, 2023

BLAS Error on worker only when using TF2

Related topics