for i in pbar:
info = {"num_steps": 1} if args.smoke_test else {}
info["epoch_idx"] = i
info["num_epochs"] = args.num_epochs
# Increase `max_retries` to turn on fault tolerance.
trainer1.train(max_retries=1, info=info)
dist.barrier()
# do some thing.
val_stats = trainer1.validate()
Ah ok, the process group is not initialized on the driver process, only the workers.
If you need the process group initialized on the driver, then you can pass in use_local=True to your TorchTrainer. This will make the rank 0 worker run on the driver process.
By turning on use_local, I still have the same error:
dist.barrier()
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1423, in barrier
_check_default_pg()
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 192, in _check_default_pg
"Default process group is not initialized"
AssertionError: Default process group is not initialized
Hmm interesting. Where in the code are you calling dist.barrier(). This is after TorchTrainer instantiation correct? Also you are using more than 1 worker right?
Ok yeah no process group is created if you only use 1 worker. If you use more than 1 and also have use_local set, then a process group will be initialized on the driver.