AccelerateTrainer & Ports

localh · August 25, 2023, 7:04pm

Is there any way to override the call / demand for a port with AccelerateTrainer?

I have a trainer setup as follows but I get repeated errors of:

  File "/azureml-envs/azureml_09fb2cbb33463830767bf82de238acd0/lib/python3.9/site-packages/accelerate/utils/other.py", line 224, in is_port_in_use
    return s.connect_ex(("localhost", port)) == 0
TypeError: an integer is required (got type str)

The accelerate code that complains is:

def is_port_in_use(port: int = None) -> bool:
    """
    Checks if a port is in use on `localhost`. Useful for checking if multiple `accelerate launch` commands have been
    run and need to see if the port is already in use.
    """
    if port is None:
        port = 29500
    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
        return s.connect_ex(("localhost", port)) == 0

My accelerate config looks like:

accelerate_config={
                        'compute_environment': 'LOCAL_MACHINE',
                        'deepspeed_config': {},
                        'distributed_type': 'MULTI_GPU',
                        'mixed_precision': 'bf16',
                        'machine_rank': 0,
                        'debug': False,
                        'same_network': False,
                        'main_training_function': 'main',
                        'num_machines': 2,
                        'num_processes': 1,
                        'gpu_ids': 'all'},

If I add a config value of port to the accelerate_config, it throws an exception.

Edit. A more minimal accelerate dict works:

accelerate_config={'debug': False,
                            'deepspeed_config': {},
                            'num_machines': 2,
                            'num_processes': 1,
        }

But this fails with the port complaint:

accelerate_config={'debug': False,
                            'deepspeed_config': {},
                            'distributed_type': 'MULTI_GPU',   # crashes
                            'fsdp_config': {},
                            'num_machines': 2,
                            'num_processes': 1,
        },

accelerate_config={'debug': False,
                            'deepspeed_config': {},
                            'distributed_type': 'FSDP',   # crashes
                            'fsdp_config': {},
                            'num_machines': 2,
                            'num_processes': 1,
        },

accelerate_config={'debug': False,
                            'deepspeed_config': {},
                            'distributed_type': 'DEEPSPEED',   # crashes
                            'fsdp_config': {},
                            'num_machines': 2,
                            'num_processes': 1,
        },

localh · August 28, 2023, 1:12am

Ok I now realize that acceleratetrainer is deprecating and likely why!

Topic		Replies	Views
Questions about using GPU for the ray[rllib] RLlib	4	1964	August 4, 2023
Ray train examples are broken Ray Train	1	598	May 10, 2022
Data Type Issues when using Ray Tune	8	887	March 2, 2023
TorchTrain fails if train_func imports functions from a different file	6	433	November 30, 2022
Ray Train with DDP on multi-node set-up	2	663	September 11, 2024

AccelerateTrainer & Ports

Related topics