No such file or directory / Performance Bottleneck

Diego_Rukoz · June 26, 2024, 1:09pm

Hello,
I seem to be having an issue with saving checkpointing and loading data up.
i am getting around the end of the trial multiple no such file or directory errors and the trial stops.
here are the error logs

Failure # 1 (occurred at 2024-06-26_01-17-18)
e[36mray::ImplicitFunc.train()e[39m (pid=106336, ip=127.0.0.1, actor_id=66972fd4a7dc27ce3bfde81701000000, repr=train_stock_model)
  File "python\ray\_raylet.pyx", line 1893, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 1834, in ray._raylet.execute_task.function_executor
  File "C:\ProgramData\anaconda3\envs\tf-gpu\lib\site-packages\ray\_private\function_manager.py", line 691, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "C:\ProgramData\anaconda3\envs\tf-gpu\lib\site-packages\ray\util\tracing\tracing_helper.py", line 467, in _resume_span
    return method(self, *_args, **_kwargs)
  File "C:\ProgramData\anaconda3\envs\tf-gpu\lib\site-packages\ray\tune\trainable\trainable.py", line 331, in train
    raise skipped from exception_cause(skipped)
  File "C:\ProgramData\anaconda3\envs\tf-gpu\lib\site-packages\ray\air\_internal\util.py", line 98, in run
    self._ret = self._target(*self._args, **self._kwargs)
  File "C:\ProgramData\anaconda3\envs\tf-gpu\lib\site-packages\ray\tune\trainable\function_trainable.py", line 174, in <lambda>
    training_func=lambda: self._trainable_func(self.config),
  File "C:\ProgramData\anaconda3\envs\tf-gpu\lib\site-packages\ray\util\tracing\tracing_helper.py", line 467, in _resume_span
    return method(self, *_args, **_kwargs)
  File "C:\ProgramData\anaconda3\envs\tf-gpu\lib\site-packages\ray\tune\trainable\function_trainable.py", line 248, in _trainable_func
    output = fn()
  File "C:\ProgramData\anaconda3\envs\tf-gpu\lib\site-packages\ray\tune\trainable\util.py", line 130, in inner
    return trainable(config, **fn_kwargs)
  File "c:\users\administrator\downloads\python\temp.py", line 84, in train_stock_model
    model.summary()
  File "C:\ProgramData\anaconda3\envs\tf-gpu\lib\site-packages\keras\engine\training.py", line 3219, in summary
    layer_utils.print_summary(
  File "C:\ProgramData\anaconda3\envs\tf-gpu\lib\site-packages\keras\utils\layer_utils.py", line 320, in print_summary
    print_fn('Model: "{}"'.format(model.name))
  File "C:\ProgramData\anaconda3\envs\tf-gpu\lib\site-packages\keras\utils\io_utils.py", line 77, in print_msg
    sys.stdout.write(message + "\n")
  File "C:\ProgramData\anaconda3\envs\tf-gpu\lib\site-packages\ray\_private\utils.py", line 412, in write
    self.stream.write(data)
FileNotFoundError: [Errno 2] No such file or directory

and here are some other warnings i am getting as well

2024-06-26 01:17:12,051	WARNING util.py:201 -- The `on_step_begin` operation took 4.172 s, which may be a performance bottleneck.

this is my current config for ray

ray.init(configure_logging=True,log_to_driver=True,num_gpus=4,ignore_reinit_error=True) #logging_level=logging.DEBUG
    algo = TuneBOHB()
    scheduler = HyperBandForBOHB(
        time_attr="training_iteration",
        max_t=100,
        stop_last_trials=True,
    )
    
    X_train20, X_test20,y_train20,y_test20 = gendata_lstm.GenerateData.GenerateData(20,20, None)
    
    data = {        
        "X_t20": X_train20, 
        "X_tt20": X_test20, 
        "y_t20":y_train20, 
        "y_tt20": y_test20
    }
    trainable_with_resources = tune.with_resources(train_stock_model, {"cpu": 3,"gpu": 0.1, "accelerator_type:A100": 0.025}) #
    
    tuner = tune.Tuner(
        tune.with_parameters(trainable_with_resources, data=data),
        tune_config=tune.TuneConfig(
            metric="val_acc",
            mode="max",
            search_alg=algo,
            scheduler=scheduler,
            num_samples=3000,
            reuse_actors=False
        ),  
        run_config=train.RunConfig(
            name="1k_1_5k",
            storage_path="Z:\\Models",
        ),
        param_space={
            "seq_length": tune.choice([20]),
            "lr": tune.loguniform(0.0005, 0.005),
            "l1": tune.choice([1024,1536]),
            "l2": tune.choice([1024,1536]),
            "l3": tune.choice([1024,1536]),
            "l4": tune.choice([1024,1536]),
            "l1_dropout" : tune.uniform(0.1,0.2),
            "l2_dropout" : tune.uniform(0.1,0.2),
            "l3_dropout" : tune.uniform(0.1,0.2),
            "l4_dropout" : tune.uniform(0.1,0.2),
            "decay": tune.loguniform(0.00005,0.001),
            "alpha": tune.loguniform(0.0005,0.005),
            "batch_size": tune.choice([8]),
            "conv1d_filters": tune.choice([16]),
            "conv1d_kernel": tune.choice([3]),
            "num_conv_layers": tune.choice([2]),
            "max_pool_size": 1
        },
    )

and here is the model fit where i am checkpointing.

    model.fit(train_data, batch_size=config["batch_size"] ,epochs=100,validation_data=val_data,verbose=False,callbacks=[ReportCheckpointCallback(metrics={"mean_accuracy": "accuracy","mean_loss":"loss"
                  , "val_loss":"val_loss","val_acc":"val_accuracy"})])

i am not sure why this seems to happen when i check the file the file is there . its almost like there is a race condition somewhere for a specific trial and then before it ends writing it it tries to load it again. if someone could help or guide on what might be happening please let me know!

Topic		Replies	Views
Can't save Checkpoint wenn using Tensorflow and PBT Ray Tune	4	1321	January 12, 2021
Cannot checkpoint a simple model RLlib	4	66	June 6, 2025
ValueError: The returned checkpoint path must be within the given checkpoint dir Ray Tune	7	391	January 25, 2021
Issue with Checkpointing in Ray 2.9.1 on Windows 11 while Training PPO Algorithm Checkpointing, Restoring	1	240	January 30, 2024
Ray train examples are broken Ray Train	1	598	May 10, 2022

No such file or directory / Performance Bottleneck

Related topics