How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hello! I am using tune to run my deep learning work. I was running several trials on one gpu but sometimes one trial might face the CUDA OOM error, so I set the max_failure= -1 to make sure this trial would got resumed later when the other trials finished.
However, I found the results.dataframe cannot work with the following error:
Traceback (most recent call last):
File “wave_1d1t_1.py”, line 100, in
main(pde(train_distribution=‘determin’, test_distribution=‘determin’,
File “wave_1d1t_1.py”, line 68, in main
best_loss_dataframes = result.dataframe(metric=“unweighted_loss”, mode=“min”)
File “/home/grads/w/wangyc/anaconda3/envs/data_pinn/lib/python3.8/site-packages/ray/tune/analysis/experiment_analysis.py”, line 134, in dataframe
rows = self._retrieve_rows(metric=metric, mode=mode)
File “/home/grads/w/wangyc/anaconda3/envs/data_pinn/lib/python3.8/site-packages/ray/tune/analysis/experiment_analysis.py”, line 351, in _retrieve_rows
idx = df[metric].idxmin()
File “/home/grads/w/wangyc/anaconda3/envs/data_pinn/lib/python3.8/site-packages/pandas/core/frame.py”, line 3458, in getitem
indexer = self.columns.get_loc(key)
File “/home/grads/w/wangyc/anaconda3/envs/data_pinn/lib/python3.8/site-packages/pandas/core/indexes/base.py”, line 3363, in get_loc
raise KeyError(key) from err
KeyError: ‘unweighted_loss’
I checked the ray results using tune.ExperimentAnalysis
and print all of the key names for each trial result. I found that only the trials which got restarted for error once responded a wrong key list. I think it incorrectly gives some key values as the key names. Here is an example:
I’m not sure if it’s my mistake or a bug, but it makes me hardly fully utilize the gpu resources since in my project, the network has two training step and usually the second step will take larger GPU memories, which menas I have to carefully change the gpus_per_trial
to avoid the CUDA OOM problem.
Could you please help me to solve that problem? Thank you so much!