Cannot correctly get the dataframe of the trials which were restarted for error once

YCWangVince · March 29, 2022, 9:53pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hello! I am using tune to run my deep learning work. I was running several trials on one gpu but sometimes one trial might face the CUDA OOM error, so I set the max_failure= -1 to make sure this trial would got resumed later when the other trials finished.

However, I found the results.dataframe cannot work with the following error:

Traceback (most recent call last):
File “wave_1d1t_1.py”, line 100, in
main(pde(train_distribution=‘determin’, test_distribution=‘determin’,
File “wave_1d1t_1.py”, line 68, in main
best_loss_dataframes = result.dataframe(metric=“unweighted_loss”, mode=“min”)
File “/home/grads/w/wangyc/anaconda3/envs/data_pinn/lib/python3.8/site-packages/ray/tune/analysis/experiment_analysis.py”, line 134, in dataframe
rows = self._retrieve_rows(metric=metric, mode=mode)
File “/home/grads/w/wangyc/anaconda3/envs/data_pinn/lib/python3.8/site-packages/ray/tune/analysis/experiment_analysis.py”, line 351, in _retrieve_rows
idx = df[metric].idxmin()
File “/home/grads/w/wangyc/anaconda3/envs/data_pinn/lib/python3.8/site-packages/pandas/core/frame.py”, line 3458, in getitem
indexer = self.columns.get_loc(key)
File “/home/grads/w/wangyc/anaconda3/envs/data_pinn/lib/python3.8/site-packages/pandas/core/indexes/base.py”, line 3363, in get_loc
raise KeyError(key) from err
KeyError: ‘unweighted_loss’

I checked the ray results using tune.ExperimentAnalysis and print all of the key names for each trial result. I found that only the trials which got restarted for error once responded a wrong key list. I think it incorrectly gives some key values as the key names. Here is an example:

I’m not sure if it’s my mistake or a bug, but it makes me hardly fully utilize the gpu resources since in my project, the network has two training step and usually the second step will take larger GPU memories, which menas I have to carefully change the gpus_per_trial to avoid the CUDA OOM problem.

Could you please help me to solve that problem? Thank you so much!

Topic		Replies	Views
Retrieving the results_df on a crashed ray tune run Ray Tune	0	221	December 30, 2020
How to continue errored out tune.run Ray Tune	1	938	April 4, 2022
How to deal with TuneError: ('Trials did not complete',...) Ray Tune	1	2110	October 6, 2021
Unable to run example, returns error message	4	950	March 14, 2023
Trials did not complete Exception Ray Tune	0	391	April 5, 2022

Cannot correctly get the dataframe of the trials which were restarted for error once

Related topics