Model output when trained multiple times

sourabh_surana · February 7, 2023, 10:47am

Hi Team ,
I have recently started using ray to scale up and speed up the training of deep learning model . For this I tried using TorchTrainer by making all the code changes. I was able to train the model much faster now. But when I was playing around with “num_workers” in ScalingConfig , I got different model (loss , prediction , scores etc) each time I ran with different “num_workers” . None of the params/configs were changed. The model is also different from the sequential training. Am I missing something here ?
Please help .

xwjiang2010 · February 7, 2023, 10:43pm

what is the (global) batch size? depending on how you write your code, it may also scale with num_workers in ScalingConfig, which can result in different models.

sourabh_surana · February 8, 2023, 4:21am

Hi ,
I didn’t define global batch size. I didn’t understand why different models are generated by just changing the num_workers . Infact If I run the training many times with the same configuration I get different models every time. I am running the code on a single machine with 64 cores CPU. Here is my trainer
TorchTrainer(train_loop_per_worker=usad.train_loop_per_worker,train_loop_config=config_dict,
scaling_config=ScalingConfig(num_workers=10, resources_per_worker={“CPU”: 1, “GPU”: 0}),
run_config=RunConfig(local_dir=f"{ray_processed_folder}/{dataset}"))

Is there a reason why the trainer gives different models every time? Can we fix that ?

xwjiang2010 · February 8, 2023, 4:55pm

what is your train_loop_per_worker function?

sourabh_surana · February 8, 2023, 5:34pm

Hi ,
train_loop_per_worker function is my training function . The code looks something like below .

def train_loop_per_worker(self,config: dict):

	train_loader = torch.utils.data.DataLoader(
	       data,
	       batch_size=config['batch_size']
	       )
	model = generate_model()
	train_data_loader = ray.train.torch.prepare_data_loader(train_loader)
	if isinstance(model, DistributedDataParallel):
		model = model.module
	
	loss_history = []
	
	for epoch in config['num_epochs']:
		for index,[batch] in enumerate(train_data_loader):
			batch = to_device(batch, device)
			train_batch(batch)
		result = evaluate(train_data_loader)
		loss_history.append(result)
		
	model.save_model(config['path'])

sourabh_surana · March 18, 2023, 12:31pm

hi , Any idea on why is this happening ?

gjoliver · March 21, 2023, 9:04am

I think what is happening here is that you are using a fixed per-worker batch size config['batch_size'] in your training loop function.
now if you have different number of workers, you are essentially training with different global batch sizes: global_batch_size = config["batch_size"] * num_workers.
that will for sure impact model training.

Another possibility is that you may be randomly shuffling the input data set, which will also result in different model every time you run it.

If you would like us to get to the bottom of this, please file a github issue against us with a complete reproducible script.

sourabh_surana · March 21, 2023, 12:06pm

Hi ,
Even with the same configurations, I am getting different model output. I also enabled train.torch.enable_reproducibility with the same seed. I guess pytorch trains the model in a non-deterministic way.

gjoliver · March 21, 2023, 1:54pm

did you follow this to set manual seeds for all of pytorch, panda, and numpy?
https://pytorch.org/docs/stable/notes/randomness.html

we studies this for RLlib at some point. CPU-only training can be completely deterministic after you set all of the above.
however, gpu training is different. there are asynchronous things built natively in the hardware that even if you set all the seeds, you would see randomness if you run training on GPU device.

sourabh_surana · March 22, 2023, 5:12am

yes , I suppose train.torch.enable_reproducibility(seed) would take care of setting the manual seeds.

gjoliver · March 22, 2023, 5:33am

yep. if you are able to share the exact script and dataset, please file an issue on github, I am curious to see where the randomness comes from.
otherwise, I guess you will have to go over each component multiple times and inspect things like does the data iterator on worker N always gives you the same batches? with the same batch of data, do you always get the same loss? are the networks initialized to the same weights across runs, etc.

sourabh_surana · March 22, 2023, 2:06pm

Yeah , I need to debug at every step of the code to see where randomness is coming from. I was under the impression that maybe it is occurring due to ray implementation. But I guess the problem might be related to the code I am writing. Sorry I couldn’t share the full code in the forum.

Topic		Replies	Views
When to use multi gpus per worker for a training job	1	209	September 15, 2024
GPU Scaling configuration for Tensorflow/Horovod/Pytorch Ray Tune	3	544	April 10, 2023
[RAYSGD] Regular pytorch program perform better than raysgd Ray Client	3	456	October 26, 2021
ScalingConfig() num_workers not corresponding to training runs? Ray Train	8	760	February 5, 2024
[RAY SGD] Train pytorch model on machine with 2 GPUs Ray Tune	2	431	February 19, 2021

Model output when trained multiple times

Related topics