Possibly Checkpoint error while running Ray tune

ppo_params = {
            "entropy_coeff": tune.loguniform(0.00000001, 0.1),
            "lr": tune.loguniform(5e-5, 1),
            "sgd_minibatch_size": tune.choice([ 32, 64, 128, 256, 512]),
            "lambda": tune.choice([0.1,0.3,0.5,0.7,0.9,1.0]),'framework':"torch",

if __name__ == '__main__':
    tuner = tune.Tuner(
    # trainable_with_resources,
    run_config= RunConfig(
        name="Trial Run",
            # checkpoint_score_attribute="episode_reward_mean",
            # checkpoint_score_order='max',
    rs = tuner.fit()

In the above code cell, I am building a ray tune pipeline for Financial Reinforcement learning. So after running this, I am getting the following error

TuneError                                 Traceback (most recent call last)
File ~/.local/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py:853, in TrialRunner._wait_and_handle_event(self, next_trial)
    852 if event.type == _ExecutorEventType.TRAINING_RESULT:
--> 853     self._on_training_result(
    854         trial, result[_ExecutorEvent.KEY_FUTURE_RESULT]
    855     )
    856 else:

File ~/.local/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py:978, in TrialRunner._on_training_result(self, trial, result)
    977 with warn_if_slow("process_trial_result"):
--> 978     self._process_trial_results(trial, result)

File ~/.local/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py:1061, in TrialRunner._process_trial_results(self, trial, results)
   1060 with warn_if_slow("process_trial_result"):
-> 1061     decision = self._process_trial_result(trial, result)
   1062 if decision is None:
   1063     # If we didn't get a decision, this means a
   1064     # non-training future (e.g. a save) was scheduled.
   1065     # We do not allow processing more results then.

File ~/.local/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py:1100, in TrialRunner._process_trial_result(self, trial, result)
   1098 self._validate_result_metrics(flat_result)
-> 1100 if self._stopper(trial.trial_id, result) or trial.should_stop(flat_result):
   1101     decision = TrialScheduler.STOP
    251     experiment_checkpoint_dir = ray.get(
    252         self._remote_tuner.get_experiment_checkpoint_dir.remote()
    253     )

TuneError: The Ray Tune run failed. Please inspect the previous error messages for a cause. After fixing the issue, you can restart the run from scratch or continue this run. To continue this run, you can use `tuner = Tuner.restore("/home/athekunal/Ray for FinRL/trial_run/Trial Run")`.

First, it starts with tuning and then it is failing after one sample. From the error trace, I think that the checkpoint functionality is giving some error. Am I missing something here?

ray: 2.1.0
python: 3.10
WSL2 with Ubuntu 22.04
cuda: 11.6

Hi @Athe-kunal,

Is there more to the error message? That last TuneError should reference another error message above.

I believe this is an issue on the stopping key that’s being used - Tune will automatically populate the training_iteration metric (there’s an extra “s” in your stopping metric name right now). Let me know if that’s indeed the issue!

1 Like

Hi @justinvyu thank you so much
I can’t believe that I was stuck with this. The last error trace said nothing about the extra s in the training_iteration. It said that the Ray tune error failed due to some previous error. The debugger in ray tune is pretty obscure. Also, I had a few other queries

  1. What is the difference between training_iterations and iterations in
for i in range(iterations):

The docs say that training_iterations is the number of times session.report has been called. Can you please elucidate this information?

  1. If I provide num_gpus>1, then all the GPUs work as local workers i.e they try to optimize the policy, or do they also aid in environment rollout collection? Is this automatically inferred by Ray or do we need to specify something?

Thanks in advance

  1. session.report() is an API that Ray AIR provides for trainables to report metrics. In the context of RLLib this means that after each training iteration (the length of which you can determine with min_train_timesteps_per_iteration and the likes) and evaluation, we will report once.
  2. For example num_gpus=2 only means that the process your algorithm is running in will be assigned two gpus. If you have a local_worker to collect samples, these GPUs can also be used for that. But generally speaking you will have remote RolloutWorkers that you can have utilize GPUS with num_gpus_per_worker.

Yes, I understood
Thank you @arturn