[AMP]Mixed precision training is slower than default precision

Yard1 · March 24, 2023, 6:08pm

I can actually replicate this on an A100, you are correct (AMP taking more time than fp32). That being said, I do not think this is related to Ray Train or Ray in general. I believe this issue from PyTorch should shed some light on the situation - torch.cuda.amp cannot speed up on A100 · Issue #57806 · pytorch/pytorch · GitHub

I have tried disabling tf32 support in the train function, and that slowed down non-AMP training considerably. Here are the times I got:

fp32 with tf32 (default): 106s
AMP with tf32: 112s
fp32 without tf32: 121s

Code to disable tf32:

def train_func(config: Dict):
    import torch.backends.cuda

    torch.backends.cuda.matmul.allow_tf32 = False
    torch.backends.cudnn.allow_tf32 = False

We may want to look into taking that into account in our AMP code, but in this case, the best way for you to move forward is to not use Ray Train’s AMP and instead implement it yourself (or use a third party library like Accelerate, for which we have added support in Ray nightly with ray.train.huggingface.accelerate.AccelerateTrainer.

Topic		Replies	Views
Can I just use amp in part of the training actors?	3	525	July 5, 2023
Data Type Issues when using Ray Tune	8	892	March 2, 2023
Pytorch AMP Support RLlib	0	426	September 15, 2021
Ray Train with Ray datasets (includes images) too slow Ray Data	5	1198	February 14, 2023
Failed to read the results for 1 trials	3	494	July 26, 2023

[AMP]Mixed precision training is slower than default precision

Related topics