"random_device could not be read"

BruceW · August 22, 2023, 2:33am

This error happens when i’m trying to run ray.tune on a Ubuntu server docker with [nohup]. Interestingly, this code can run in Jupyter Notebook, can’t run with .py (no response and no error), can’t run with .py nohup (the error above). Below is the full error information and my code snapshot.

2023-08-22 09:39:41.823427: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-08-22 09:39:41.823726: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-22 09:39:51.692399: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-08-22 09:40:16,058	INFO utils.py:593 -- Detected RAY_USE_MULTIPROCESSING_CPU_COUNT=1: Using multiprocessing.cpu_count() to detect the number of CPUs. This may be inconsistent when used inside docker. To correctly detect CPUs, unset the env var: `RAY_USE_MULTIPROCESSING_CPU_COUNT`.
2023-08-22 09:40:16,060	WARNING services.py:1816 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 66912256 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2023-08-22 09:40:18,130	INFO worker.py:1616 -- Started a local Ray instance. View the dashboard at e[1me[32m127.0.0.1:8266 e[39me[22m
terminate called after throwing an instance of 'std::runtime_error'
  what():  random_device could not be read
*** SIGABRT received at time=1692668430 on cpu 24 ***
PC: @     0x7f23f078da7c  (unknown)  pthread_kill
    @     0x7f22fb027d8d         64  absl::lts_20220623::AbslFailureSignalHandler()
    @     0x7f23f0739520  (unknown)  (unknown)
[2023-08-22 09:40:30,436 E 754795 754795] logging.cc:361: *** SIGABRT received at time=1692668430 on cpu 24 ***
[2023-08-22 09:40:30,436 E 754795 754795] logging.cc:361: PC: @     0x7f23f078da7c  (unknown)  pthread_kill
[2023-08-22 09:40:30,437 E 754795 754795] logging.cc:361:     @     0x7f22fb027db9         64  absl::lts_20220623::AbslFailureSignalHandler()
[2023-08-22 09:40:30,437 E 754795 754795] logging.cc:361:     @     0x7f23f0739520  (unknown)  (unknown)
Fatal Python error: Aborted

Stack (most recent call first):
  File "<frozen importlib._bootstrap>", line 228 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 1173 in create_module
  File "<frozen importlib._bootstrap>", line 565 in module_from_spec
  File "<frozen importlib._bootstrap>", line 666 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 986 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1007 in _find_and_load
  File "/opt/conda/envs/ae/lib/python3.9/site-packages/torch/__init__.py", line 229 in <module>
  File "<frozen importlib._bootstrap>", line 228 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 850 in exec_module
  File "<frozen importlib._bootstrap>", line 680 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 986 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1007 in _find_and_load
  File "/opt/conda/envs/ae/lib/python3.9/site-packages/tensorboardX/writer.py", line 33 in <module>
  File "<frozen importlib._bootstrap>", line 228 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 850 in exec_module
  File "<frozen importlib._bootstrap>", line 680 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 986 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1007 in _find_and_load
  File "/opt/conda/envs/ae/lib/python3.9/site-packages/tensorboardX/torchvis.py", line 11 in <module>
  File "<frozen importlib._bootstrap>", line 228 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 850 in exec_module
  File "<frozen importlib._bootstrap>", line 680 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 986 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1007 in _find_and_load
  File "/opt/conda/envs/ae/lib/python3.9/site-packages/tensorboardX/__init__.py", line 5 in <module>
  File "<frozen importlib._bootstrap>", line 228 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 850 in exec_module
  File "<frozen importlib._bootstrap>", line 680 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 986 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1007 in _find_and_load
  File "/opt/conda/envs/ae/lib/python3.9/site-packages/ray/tune/logger/tensorboardx.py", line 167 in __init__
  File "/opt/conda/envs/ae/lib/python3.9/site-packages/ray/tune/utils/callback.py", line 139 in _create_default_callbacks
  File "/opt/conda/envs/ae/lib/python3.9/site-packages/ray/tune/tune.py", line 785 in run
  File "/opt/conda/envs/ae/lib/python3.9/site-packages/ray/tune/impl/tuner_internal.py", line 621 in _fit_internal
  File "/opt/conda/envs/ae/lib/python3.9/site-packages/ray/tune/impl/tuner_internal.py", line 503 in fit
  File "/opt/conda/envs/ae/lib/python3.9/site-packages/ray/tune/tuner.py", line 367 in fit
  File "/root/wzc/ray&mlflow/results_ray.py", line 102 in tune_with_callback
  File "/root/wzc/ray&mlflow/results_ray.py", line 119 in <module>

if __name__ == '__main__':
    raw = np.random.randn(6,100,100)
    train_data = [raw[0], raw[1], raw[2]]
    test_data = [raw[3], raw[4], raw[5]]
    print(f'size of X1_train data: {train_data[0].shape}')

    mlflow_tracking_uri = "mlruns_test"
    result = tune_with_callback(mlflow_tracking_uri)
    ray.shutdown()

def train_function(config, train_set=None, valid_set=None):
    train_data = ray.get(train_set)
    valid_data = ray.get(valid_set)

    session.report(
        metrics={
            "r2_pred": 1
        },
    )


def tune_with_callback(mlflow_tracking_uri):
    ray.init(ignore_reinit_error=True)
    # ray.init(num_cpus=20, num_gpus=0, object_store_memory=170*1000*1000*1000, ignore_reinit_error=True)

    train_set = ray.put(train_data)
    valid_set = ray.put(test_data)

    tuner = tune.Tuner(

        # tune.with_resources(partial(train_function, train_set=train_set, valid_set=valid_set), {"cpu": 3}),
        partial(train_function, train_set=train_set, valid_set=valid_set),

        param_space={
            "n_factors": 5,
            "n_layers": tune.grid_search([0,1,2,3]),
        },

        tune_config=tune.TuneConfig(num_samples=1,
                                    search_alg=BasicVariantGenerator(max_concurrent=4),
                                    # search_alg=ConcurrencyLimiter(OptunaSearch(metric='val_loss', mode='min'), 10),
                                    # scheduler=ASHAScheduler(metric='ccc_loss', mode='min')
                                    ),

        run_config=air.RunConfig(
            name=f"ae_result_{datetime.now().strftime('%Y/%m/%d-%H:%M:%S')}",
            callbacks=[
                MLflowLoggerCallback(
                    tracking_uri=mlflow_tracking_uri,
                    experiment_name=f"ae_result_{datetime.now().strftime('%Y/%m/%d-%H:%M:%S')}",
                    save_artifact=True)
            ],
            failure_config=air.FailureConfig(max_failures=2)
        ),

    )

    results = tuner.fit()

    return results

import os
import tensorflow as tf
import ray
from ray import air, tune
from ray.air import session
from ray.air.integrations.mlflow import MLflowLoggerCallback
from ray.tune.search.basic_variant import BasicVariantGenerator
# os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
os.environ['RAY_OBJECT_STORE_ALLOW_SLOW_STORAGE']= '1'
# os.environ['RAY_USE_MULTIPROCESSING_CPU_COUNT']='1'

import warnings
warnings.filterwarnings('ignore')

from keras.optimizers import Adam
from keras import backend as K
from keras.callbacks import EarlyStopping, Callback
from keras.regularizers import L1
from keras.models import Model, load_model
from keras.layers import Input, Dense, Dot, Reshape, BatchNormalization
# import keras.metrics

import mlflow

import pandas as pd
import numpy as np

from itertools import product
from functools import partial
from datetime import datetime
from time import time

matthewdeng · August 22, 2023, 4:48am

Hey what happens if you just try to run import torch? Looks like that is where the error is being raised.

BruceW · August 22, 2023, 5:48am

still the same error. but it is kind of interesting why torch, tensorflow and tensorboard all showed up in the error information since i have none of them in the trainable.

matthewdeng · August 22, 2023, 6:11am

Ah by default Tune includes a Tensorboard logger… which can be disabled by setting the environment variable TUNE_DISABLE_AUTO_CALLBACK_LOGGERS=1

See docs here.

Topic		Replies	Views
Tune.run() with docker is not using gpu Ray Tune	7	2066	May 31, 2022
Ray jobs stuck pending in docker container when using GPU on mnist example Ray Tune	4	941	July 12, 2021
Status: all CUDA-capable devices are busy or unavailable Ray Tune	7	1788	February 15, 2022
ERROR: Check failed: resource_pair.second > 0 Ray Tune	2	374	October 18, 2021
How to test if tensorflow can see GPU on worker node using Ray? Ray Core	2	338	January 19, 2021

"random_device could not be read"

Related topics