Hi Team ,
I have recently started using ray to scale up and speed up the training of deep learning model . For this I tried using TorchTrainer by making all the code changes. I was able to train the model much faster now. But when I was playing around with “num_workers” in ScalingConfig , I got different model (loss , prediction , scores etc) each time I ran with different “num_workers” . None of the params/configs were changed. The model is also different from the sequential training. Am I missing something here ?
Please help .
what is the (global) batch size? depending on how you write your code, it may also scale with num_workers
in ScalingConfig, which can result in different models.
Hi ,
I didn’t define global batch size. I didn’t understand why different models are generated by just changing the num_workers . Infact If I run the training many times with the same configuration I get different models every time. I am running the code on a single machine with 64 cores CPU. Here is my trainer
TorchTrainer(train_loop_per_worker=usad.train_loop_per_worker,train_loop_config=config_dict,
scaling_config=ScalingConfig(num_workers=10, resources_per_worker={“CPU”: 1, “GPU”: 0}),
run_config=RunConfig(local_dir=f"{ray_processed_folder}/{dataset}"))
Is there a reason why the trainer gives different models every time? Can we fix that ?
what is your train_loop_per_worker
function?
Hi ,
train_loop_per_worker function is my training function . The code looks something like below .
def train_loop_per_worker(self,config: dict):
train_loader = torch.utils.data.DataLoader(
data,
batch_size=config['batch_size']
)
model = generate_model()
train_data_loader = ray.train.torch.prepare_data_loader(train_loader)
if isinstance(model, DistributedDataParallel):
model = model.module
loss_history = []
for epoch in config['num_epochs']:
for index,[batch] in enumerate(train_data_loader):
batch = to_device(batch, device)
train_batch(batch)
result = evaluate(train_data_loader)
loss_history.append(result)
model.save_model(config['path'])
hi , Any idea on why is this happening ?
I think what is happening here is that you are using a fixed per-worker batch size config['batch_size']
in your training loop function.
now if you have different number of workers, you are essentially training with different global batch sizes: global_batch_size = config["batch_size"] * num_workers
.
that will for sure impact model training.
Another possibility is that you may be randomly shuffling the input data set, which will also result in different model every time you run it.
If you would like us to get to the bottom of this, please file a github issue against us with a complete reproducible script.
Hi ,
Even with the same configurations, I am getting different model output. I also enabled train.torch.enable_reproducibility with the same seed. I guess pytorch trains the model in a non-deterministic way.
did you follow this to set manual seeds for all of pytorch, panda, and numpy?
https://pytorch.org/docs/stable/notes/randomness.html
we studies this for RLlib at some point. CPU-only training can be completely deterministic after you set all of the above.
however, gpu training is different. there are asynchronous things built natively in the hardware that even if you set all the seeds, you would see randomness if you run training on GPU device.
yes , I suppose train.torch.enable_reproducibility(seed) would take care of setting the manual seeds.
yep. if you are able to share the exact script and dataset, please file an issue on github, I am curious to see where the randomness comes from.
otherwise, I guess you will have to go over each component multiple times and inspect things like does the data iterator on worker N always gives you the same batches? with the same batch of data, do you always get the same loss? are the networks initialized to the same weights across runs, etc.
Yeah , I need to debug at every step of the code to see where randomness is coming from. I was under the impression that maybe it is occurring due to ray implementation. But I guess the problem might be related to the code I am writing. Sorry I couldn’t share the full code in the forum.