Hello, wondering if someone can offer me some advice regarding this. Using a PyTorch lightning module similar to the MNIST example but keep finding bottlenecks
2021-01-20 12:44:50,088 WARNING util.py:150 -- The `callbacks.on_trial_result` operation took 111.168 s, which may be a performance bottleneck.
2021-01-20 12:44:50,090 WARNING util.py:150 -- The `process_trial_result` operation took 111.170 s, which may be a performance bottleneck.
2021-01-20 12:44:50,090 WARNING util.py:150 -- Processing trial results took 111.170 s, which may be a performance bottleneck. Please consider reporting results less frequently to Ray Tune.
2021-01-20 12:44:50,090 WARNING util.py:150 -- The `process_trial` operation took 111.170 s, which may be a performance bottleneck.
I saw speed ups after reducing the config file to tuneable hyper parameters only. I am using tune.with_parameters.
I also tried setting reporting frequency to 500s, but still getting the same issue.
My guess is that tune is using up a lot of time on checkpointing every validation loop?
Would creating a callback that only saves a checkpoint if performance improves on a metric be a viable solution?
In addition, because checkpoints are so frequent and large, syncing from the worker node to head node likely takes a long time?
When turning off syncing on checkpoint, things go awry and the experiment will randomly break.