How do I get tune.run to handle CUDA out of memory errors?

An explicit thing I’m trying to do with hyper parameter search is to learn what the limits are with my 8gb of VRAM. I would like it so that if CUDA runs out of memory on a given run that this is simply treat as a high loss or some other “soft fail” that simply redirects the search algorithm somewhere else rather than cause the whole application to crash.

Hey @mm04926412, could you post the crash logs?

Usually, Tune shouldn’t fail overall if one single trial dies/

2020-12-09 19:17:35,914 INFO services.py:1090 -- View the Ray dashboard at http://127.0.0.1:8265
2020-12-09 19:17:38,131 WARNING function_runner.py:539 -- Function checkpointing is disabled. This may result in unexpected behavior when using checkpointing features or certain schedulers. To enable, set the train function arguments to be `func(config, checkpoint_dir=None)`.
2020-12-09 19:17:41,170 WARNING worker.py:1091 -- Warning: The actor ImplicitFunc has size 13186195 when pickled. It will be stored in Redis, which could cause memory issues. This may mean that its definition uses a large array or other object.
== Status ==
Memory usage on this node: 6.9/62.8 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 512.000: None | Iter 256.000: None | Iter 128.000: None | Iter 64.000: None | Iter 32.000: None | Iter 16.000: None | Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Resources requested: 8/8 CPUs, 1/1 GPUs, 0.0/34.33 GiB heap, 0.0/11.82 GiB objects (0/1.0 accelerator_type:RTX)
Result logdir: /home/michael/ray_results/tunefunc_2020-12-09_19-17-38
Number of trials: 1/1 (1 RUNNING)
+----------------------+----------+-------+------+--------------------+--------------------------+-------------+----------------------+-------------------+
| Trial name           | status   | loc   | af   |   attention_blocks |   attention_dim_per_head |   num_heads | post_pool_layers     | pre_pool_layers   |
|----------------------+----------+-------+------+--------------------+--------------------------+-------------+----------------------+-------------------|
| tunefunc_33b78_00000 | RUNNING  |       | mish |                 32 |                      128 |          16 | [512, 512, 512, 512] | [16]              |
+----------------------+----------+-------+------+--------------------+--------------------------+-------------+----------------------+-------------------+


(pid=4075) {'pipeline_name': 'Agni_Elem_test', 'h5_file': 'Data/Structure_Cache_1000.hdf5', 'Features': [{'name': 'OPSiteFingerprint', 'Featurizer_PArgs': [], 'Featurizer_KArgs': {}}, {'name': 'AGNIFingerprints', 'Featurizer_PArgs': [], 'Featurizer_KArgs': {'etas': None, 'cutoff': 16}}], 'site_feature_size': 69, 'activation_function': 'mish', 'embedding_size': 64, 'attention_dim_per_head': 128, 'attention_blocks': 32, 'attention_heads': 8, 'pre_pool_layers': [16], 'post_pool_layers': [512, 512, 512, 512], 'Optimizer': {'Name': 'AdamW', 'Kwargs': {}}, 'Batch_Size': 32, 'Trainer kwargs': {'auto_lr_find': False, 'max_epochs': 1000}, 'alpha': 1, 'beta': 1, 'af': 'mish', 'num_heads': 16}
(pid=4075) Starting to init trainer!
(pid=4075) trainer is init now
(pid=4075) GPU available: True, used: True
(pid=4075) TPU available: False, using: 0 TPU cores
(pid=4075) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
(pid=4075) Using native 16bit precision.
(pid=4075) 
(pid=4075)   | Name    | Type                        | Params
(pid=4075) --------------------------------------------------------
(pid=4075) 0 | encoder | Crystal_Transformer_Encoder | 942 M 
(pid=4075) 1 | decoder | Infomax_Decoder             | 594 K 
Validation sanity check: 0it [00:00, ?it/s]
Validation sanity check:  50%|█████     | 1/2 [00:09<00:09,  9.90s/it]
Validation sanity check: 100%|██████████| 2/2 [00:10<00:00,  7.00s/it]
                                                                      
Epoch 0:   0%|          | 0/30 [00:00<?, ?it/s] 
(pid=4075) 2020-12-09 19:18:22,223      ERROR function_runner.py:254 -- Runner Thread raised error.
(pid=4075) Traceback (most recent call last):
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run
(pid=4075)     self._entrypoint()
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint
(pid=4075)     return self._trainable_func(self.config, self._status_reporter,
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/ray/tune/function_runner.py", line 575, in _trainable_func
(pid=4075)     output = fn()
(pid=4075)   File "hparam_search.py", line 76, in tunefunc
(pid=4075)     train_model(config, Dataset)
(pid=4075)   File "hparam_search.py", line 61, in train_model
(pid=4075)     trainer.fit(model)
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 446, in fit
(pid=4075)     results = self.accelerator_backend.train()
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 64, in train
(pid=4075)     results = self.train_or_test()
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 66, in train_or_test
(pid=4075)     results = self.trainer.train()
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 495, in train
(pid=4075)     self.train_loop.run_training_epoch()
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 561, in run_training_epoch
(pid=4075)     batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 728, in run_training_batch
(pid=4075)     self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 469, in optimizer_step
(pid=4075)     self.trainer.accelerator_backend.optimizer_step(
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 114, in optimizer_step
(pid=4075)     model_ref.optimizer_step(
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1372, in optimizer_step
(pid=4075)     optimizer_closure()
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 718, in train_step_and_backward_closure
(pid=4075)     result = self.training_step_and_backward(
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 813, in training_step_and_backward
(pid=4075)     result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 320, in training_step
(pid=4075)     training_step_output = self.trainer.accelerator_backend.training_step(args)
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 70, in training_step
(pid=4075)     output = self.__training_step(args)
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 80, in __training_step
(pid=4075)     output = self.trainer.model.training_step(*args)
(pid=4075)   File "/home/michael/Crystal_Transformer_InfoMax/lightning_module.py", line 210, in training_step
(pid=4075)     Global_Embedding, std_log = self.encoder.forward(
(pid=4075)   File "/home/michael/Crystal_Transformer_InfoMax/modules.py", line 209, in forward
(pid=4075)     x = block(x, Coulomb_Matrix, Distance_Matrix, Attention_Mask)
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
(pid=4075)     result = self.forward(*input, **kwargs)
(pid=4075)   File "/home/michael/Crystal_Transformer_InfoMax/modules.py", line 100, in forward
(pid=4075)     distance, self.af(self.distance_linear(x))
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
(pid=4075)     result = self.forward(*input, **kwargs)
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 93, in forward
(pid=4075)     return F.linear(input, self.weight, self.bias)
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/torch/nn/functional.py", line 1692, in linear
(pid=4075)     output = input.matmul(weight.t())
(pid=4075) RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 7.79 GiB total capacity; 6.44 GiB already allocated; 17.81 MiB free; 6.87 GiB reserved in total by PyTorch)
(pid=4075) Exception in thread Thread-2:
(pid=4075) Traceback (most recent call last):
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/threading.py", line 932, in _bootstrap_inner
(pid=4075)     self.run()
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/ray/tune/function_runner.py", line 267, in run
(pid=4075)     raise e
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run
(pid=4075)     self._entrypoint()
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint
(pid=4075)     return self._trainable_func(self.config, self._status_reporter,
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/ray/tune/function_runner.py", line 575, in _trainable_func
(pid=4075)     output = fn()
(pid=4075)   File "hparam_search.py", line 76, in tunefunc
(pid=4075)     train_model(config, Dataset)
(pid=4075)   File "hparam_search.py", line 61, in train_model
(pid=4075)     trainer.fit(model)
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 446, in fit
(pid=4075)     results = self.accelerator_backend.train()
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 64, in train
(pid=4075)     results = self.train_or_test()
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 66, in train_or_test
(pid=4075)     results = self.trainer.train()
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 495, in train
(pid=4075)     self.train_loop.run_training_epoch()
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 561, in run_training_epoch
(pid=4075)     batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 728, in run_training_batch
(pid=4075)     self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 469, in optimizer_step
(pid=4075)     self.trainer.accelerator_backend.optimizer_step(
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 114, in optimizer_step
(pid=4075)     model_ref.optimizer_step(
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1372, in optimizer_step
(pid=4075)     optimizer_closure()
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 718, in train_step_and_backward_closure
(pid=4075)     result = self.training_step_and_backward(
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 813, in training_step_and_backward
(pid=4075)     result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 320, in training_step
(pid=4075)     training_step_output = self.trainer.accelerator_backend.training_step(args)
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 70, in training_step
(pid=4075)     output = self.__training_step(args)
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 80, in __training_step
(pid=4075)     output = self.trainer.model.training_step(*args)
(pid=4075)   File "/home/michael/Crystal_Transformer_InfoMax/lightning_module.py", line 210, in training_step
(pid=4075)     Global_Embedding, std_log = self.encoder.forward(
(pid=4075)   File "/home/michael/Crystal_Transformer_InfoMax/modules.py", line 209, in forward
(pid=4075)     x = block(x, Coulomb_Matrix, Distance_Matrix, Attention_Mask)
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
(pid=4075)     result = self.forward(*input, **kwargs)
(pid=4075)   File "/home/michael/Crystal_Transformer_InfoMax/modules.py", line 100, in forward
(pid=4075)     distance, self.af(self.distance_linear(x))
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
(pid=4075)     result = self.forward(*input, **kwargs)
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 93, in forward
(pid=4075)     return F.linear(input, self.weight, self.bias)
(pid=4075)   File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/torch/nn/functional.py", line 1692, in linear
(pid=4075)     output = input.matmul(weight.t())
(pid=4075) RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 7.79 GiB total capacity; 6.44 GiB already allocated; 17.81 MiB free; 6.87 GiB reserved in total by PyTorch)
2020-12-09 19:18:24,133 ERROR trial_runner.py:793 -- Trial tunefunc_33b78_00000: Error processing event.
Traceback (most recent call last):
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/ray/tune/trial_runner.py", line 726, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/ray/tune/ray_trial_executor.py", line 489, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/ray/worker.py", line 1452, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train() (pid=4075, ip=192.168.0.14)
  File "python/ray/_raylet.pyx", line 482, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 436, in ray._raylet.execute_task.function_executor
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/ray/tune/trainable.py", line 336, in train
    result = self.step()
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/ray/tune/function_runner.py", line 366, in step
    self._report_thread_runner_error(block=True)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/ray/tune/function_runner.py", line 512, in _report_thread_runner_error
    raise TuneError(("Trial raised an exception. Traceback:\n{}"
ray.tune.error.TuneError: Trial raised an exception. Traceback:
ray::ImplicitFunc.train() (pid=4075, ip=192.168.0.14)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/ray/tune/function_runner.py", line 248, in run
    self._entrypoint()
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/ray/tune/function_runner.py", line 315, in entrypoint
    return self._trainable_func(self.config, self._status_reporter,
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/ray/tune/function_runner.py", line 575, in _trainable_func
    output = fn()
  File "hparam_search.py", line 76, in tunefunc
    train_model(config, Dataset)
  File "hparam_search.py", line 61, in train_model
    trainer.fit(model)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 446, in fit
    results = self.accelerator_backend.train()
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 64, in train
    results = self.train_or_test()
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 66, in train_or_test
    results = self.trainer.train()
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 495, in train
    self.train_loop.run_training_epoch()
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 561, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 728, in run_training_batch
    self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 469, in optimizer_step
    self.trainer.accelerator_backend.optimizer_step(
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 114, in optimizer_step
    model_ref.optimizer_step(
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1372, in optimizer_step
    optimizer_closure()
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 718, in train_step_and_backward_closure
    result = self.training_step_and_backward(
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 813, in training_step_and_backward
    result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 320, in training_step
    training_step_output = self.trainer.accelerator_backend.training_step(args)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 70, in training_step
    output = self.__training_step(args)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 80, in __training_step
    output = self.trainer.model.training_step(*args)
  File "/home/michael/Crystal_Transformer_InfoMax/lightning_module.py", line 210, in training_step
    Global_Embedding, std_log = self.encoder.forward(
  File "/home/michael/Crystal_Transformer_InfoMax/modules.py", line 209, in forward
    x = block(x, Coulomb_Matrix, Distance_Matrix, Attention_Mask)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/michael/Crystal_Transformer_InfoMax/modules.py", line 100, in forward
    distance, self.af(self.distance_linear(x))
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 93, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/torch/nn/functional.py", line 1692, in linear
    output = input.matmul(weight.t())
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 7.79 GiB total capacity; 6.44 GiB already allocated; 17.81 MiB free; 6.87 GiB reserved in total by PyTorch)
== Status ==
Memory usage on this node: 10.8/62.8 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 512.000: None | Iter 256.000: None | Iter 128.000: None | Iter 64.000: None | Iter 32.000: None | Iter 16.000: None | Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Resources requested: 0/8 CPUs, 0/1 GPUs, 0.0/34.33 GiB heap, 0.0/11.82 GiB objects (0/1.0 accelerator_type:RTX)
Result logdir: /home/michael/ray_results/tunefunc_2020-12-09_19-17-38
Number of trials: 1/1 (1 ERROR)
+----------------------+----------+-------+------+--------------------+--------------------------+-------------+----------------------+-------------------+
| Trial name           | status   | loc   | af   |   attention_blocks |   attention_dim_per_head |   num_heads | post_pool_layers     | pre_pool_layers   |
|----------------------+----------+-------+------+--------------------+--------------------------+-------------+----------------------+-------------------|
| tunefunc_33b78_00000 | ERROR    |       | mish |                 32 |                      128 |          16 | [512, 512, 512, 512] | [16]              |
+----------------------+----------+-------+------+--------------------+--------------------------+-------------+----------------------+-------------------+
Number of errored trials: 1
+----------------------+--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name           |   # failures | error file                                                                                                                                                                                                              |
|----------------------+--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| tunefunc_33b78_00000 |            1 | /home/michael/ray_results/tunefunc_2020-12-09_19-17-38/tunefunc_33b78_00000_0_af=mish,attention_blocks=32,attention_dim_per_head=128,num_heads=16,post_pool_layers=[512, 512, 512, 512],p_2020-12-09_19-17-39/error.txt |
+----------------------+--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

== Status ==
Memory usage on this node: 10.8/62.8 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 512.000: None | Iter 256.000: None | Iter 128.000: None | Iter 64.000: None | Iter 32.000: None | Iter 16.000: None | Iter 8.000: None | Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
Resources requested: 0/8 CPUs, 0/1 GPUs, 0.0/34.33 GiB heap, 0.0/11.82 GiB objects (0/1.0 accelerator_type:RTX)
Result logdir: /home/michael/ray_results/tunefunc_2020-12-09_19-17-38
Number of trials: 1/1 (1 ERROR)
+----------------------+----------+-------+------+--------------------+--------------------------+-------------+----------------------+-------------------+
| Trial name           | status   | loc   | af   |   attention_blocks |   attention_dim_per_head |   num_heads | post_pool_layers     | pre_pool_layers   |
|----------------------+----------+-------+------+--------------------+--------------------------+-------------+----------------------+-------------------|
| tunefunc_33b78_00000 | ERROR    |       | mish |                 32 |                      128 |          16 | [512, 512, 512, 512] | [16]              |
+----------------------+----------+-------+------+--------------------+--------------------------+-------------+----------------------+-------------------+
Number of errored trials: 1
+----------------------+--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Trial name           |   # failures | error file                                                                                                                                                                                                              |
|----------------------+--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| tunefunc_33b78_00000 |            1 | /home/michael/ray_results/tunefunc_2020-12-09_19-17-38/tunefunc_33b78_00000_0_af=mish,attention_blocks=32,attention_dim_per_head=128,num_heads=16,post_pool_layers=[512, 512, 512, 512],p_2020-12-09_19-17-39/error.txt |
+----------------------+--------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

2020-12-09 19:18:24,160 ERROR tune.py:436 -- Trials did not complete: [tunefunc_33b78_00000]
2020-12-09 19:18:24,160 INFO tune.py:439 -- Total run time: 49.08 seconds (45.23 seconds for the tuning loop).
Traceback (most recent call last):
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2895, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'trial_id'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "hparam_search.py", line 121, in <module>
    train_infomax_asha(config, Dataset)
  File "hparam_search.py", line 86, in train_infomax_asha
    df = analysis.results_df
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/ray/tune/analysis/experiment_analysis.py", line 488, in results_df
    return pd.DataFrame.from_records(
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pandas/core/frame.py", line 1807, in from_records
    i = columns.get_loc(index)
  File "/home/michael/anaconda3/envs/pytorch-cuda-11/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
    raise KeyError(key) from err
KeyError: 'trial_id'

My apologies if I posted too much I’m not sure how much of the output you wanted.

No problem!

I added backticks to format it as code. Can you also post your tune.run call?

I think I solved my problem. I just had to add the kwarg “max_failures=-1” to tune.run and it appears the crash was caused by code after tune.run in my main.py

def train_infomax_asha(config, Dataset):
    scheduler = ASHAScheduler(
        max_t=config["Trainer kwargs"]["max_epochs"],
        grace_period=1,
        reduction_factor=2,
        metric="loss",
        mode="min",
    )
    reporter = CLIReporter(metric_columns=["loss", "training_iteration"])
    resources_per_trial = {"cpu": cpu_count(), "gpu": 1}

    def tunefunc(config):
        train_model(config, Dataset)

    analysis = tune.run(
        tunefunc,
        resources_per_trial=resources_per_trial,
        progress_reporter=reporter,
        scheduler=scheduler,
        config=config,
        raise_on_failed_trial=False,
        max_failures=-1,
    )
    df = analysis.results_df
    df.to_csv("analysis_tune.csv")

This is the code now if thats what you meant? My intent really was to specifically grab an “out of memory” error and respond by making it reduce the value of hparams until its working again.

Right; max_failures will retry the same parameters over and over.

Maybe consider passing in a larger search space through config?

I’ve passed through a large search space which has extreme ends for both the nodes per layer and number of layers with the assumption that large NPL + large layer count would result in a cuda out of memory error.

If I run without failure tolerance then the whole hyper param search seems to end as soon as it encounters a failure but with max_failures it will retry the same params over? How would I make it so the script simply “moves on”?

Hmm, it seems to me that there’s only 1 trial that’s being run (from the entire script). Can you post your config?

Do you need to set tune.run(num_samples=N)?

Oh I see and this is 1 by default? I have changed this and got better results but I’m slightly confused as I thought there was an algorithm that decided when it had found optimal hyper-parameters and make its own decision to stop?

You’ll have to configure Tune to do that with a Search Algorithm!

Ah I see! I erroneously thought the schedueler handled this thank you!

1 Like