Is there any way to override the call / demand for a port with AccelerateTrainer
?
I have a trainer setup as follows but I get repeated errors of:
File "/azureml-envs/azureml_09fb2cbb33463830767bf82de238acd0/lib/python3.9/site-packages/accelerate/utils/other.py", line 224, in is_port_in_use
return s.connect_ex(("localhost", port)) == 0
TypeError: an integer is required (got type str)
The accelerate code that complains is:
def is_port_in_use(port: int = None) -> bool:
"""
Checks if a port is in use on `localhost`. Useful for checking if multiple `accelerate launch` commands have been
run and need to see if the port is already in use.
"""
if port is None:
port = 29500
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
return s.connect_ex(("localhost", port)) == 0
My accelerate config looks like:
accelerate_config={
'compute_environment': 'LOCAL_MACHINE',
'deepspeed_config': {},
'distributed_type': 'MULTI_GPU',
'mixed_precision': 'bf16',
'machine_rank': 0,
'debug': False,
'same_network': False,
'main_training_function': 'main',
'num_machines': 2,
'num_processes': 1,
'gpu_ids': 'all'},
If I add a config value of port
to the accelerate_config
, it throws an exception.
Edit. A more minimal accelerate dict works:
accelerate_config={'debug': False,
'deepspeed_config': {},
'num_machines': 2,
'num_processes': 1,
}
But this fails with the port complaint:
accelerate_config={'debug': False,
'deepspeed_config': {},
'distributed_type': 'MULTI_GPU', # crashes
'fsdp_config': {},
'num_machines': 2,
'num_processes': 1,
},
accelerate_config={'debug': False,
'deepspeed_config': {},
'distributed_type': 'FSDP', # crashes
'fsdp_config': {},
'num_machines': 2,
'num_processes': 1,
},
accelerate_config={'debug': False,
'deepspeed_config': {},
'distributed_type': 'DEEPSPEED', # crashes
'fsdp_config': {},
'num_machines': 2,
'num_processes': 1,
},