[AMP]Mixed precision training is slower than default precision

I can actually replicate this on an A100, you are correct (AMP taking more time than fp32). That being said, I do not think this is related to Ray Train or Ray in general. I believe this issue from PyTorch should shed some light on the situation - torch.cuda.amp cannot speed up on A100 · Issue #57806 · pytorch/pytorch · GitHub

I have tried disabling tf32 support in the train function, and that slowed down non-AMP training considerably. Here are the times I got:

  • fp32 with tf32 (default): 106s
  • AMP with tf32: 112s
  • fp32 without tf32: 121s

Code to disable tf32:

def train_func(config: Dict):
    import torch.backends.cuda

    torch.backends.cuda.matmul.allow_tf32 = False
    torch.backends.cudnn.allow_tf32 = False

We may want to look into taking that into account in our AMP code, but in this case, the best way for you to move forward is to not use Ray Train’s AMP and instead implement it yourself (or use a third party library like Accelerate, for which we have added support in Ray nightly with ray.train.huggingface.accelerate.AccelerateTrainer.