Tensorflow and Pytorch cannot distributed training

yydai · January 18, 2024, 12:12pm

I have created a cluster with one head node and 2 worker nodes,

I followed the training example tensorflow_mnist_example for distributed training, but what confuses me is that only The head node can run the training process.
I printed the TF_CONFIG env and only the head node ip address can be seen.

{
    'cluster': {'worker': ['<head node ip>:37909', '<head node ip>:48507']}, 
    'task': {'type': 'worker', 'index': 1}
}

Is there something wrong with my environment?

justinvyu · January 18, 2024, 9:07pm

Hi @yydai, what do you mean by “The head node can run the training process.”?

yydai · January 19, 2024, 3:32am

My tensorflow training task is only executed on the head node, and there is no training task scheduling on the worker node.

cluster info

head node:  10.68.134.13
work1 node: 10.88.112.16
work2 node: 10.88.112.17

The log information is as follows

Connecting to existing Ray cluster at address: 10.68.134.13:6379...
Connected to Ray cluster. View the dashboard at 10.68.134.13:8265

2024-01-19 02:46:43.053895: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.68.134.13:53431, 1 -> 10.68.134.13:56466, 2 -> 10.68.134.13:49974, 3 -> 10.68.134.13:58255} [repeated 3x across cluster]

and I print the tf cluster TF_CONFIG env, this config is not set by myself

{'cluster': {'worker': ['10.68.134.13:53431', '10.68.134.13:56466', '10.68.134.13:49974', '10.68.134.13:58255']}, 'task': {'type': 'worker', 'index': 1}}

only show the head node ip (10.68.134.13) in this ENV.

My questions are:

Should I set TF_COFIG manually and how?
Where the TF_CONFIG come from?
How to debug this problem?

yydai · January 22, 2024, 2:41am

Has anyone encountered this problem?

Add more infomation:
1、My environment is docker container
2、ray version： 2.9.0
3、tensorflow version：2.8.1

justinvyu · January 22, 2024, 7:12pm

Hi @yydai, it looks like multiple workers are being scheduled, but the problem is that all of them are on the head node.

What does your cluster setup look like? (number of nodes, number of GPUs on each node)

What does your ScalingConfig look like?

If you have multiple GPUs on the head node, then Ray might schedule all of the workers on the head node (assigning 1 GPU per worker unless you specify more).

yydai · January 23, 2024, 2:46am

Hi @justinvyu , thank you for your reply.

In fact, it was indeed scheduled to one node when number_workers is small(ray demo default set 2). When I increased the number of workers to 40, it started to be scheduled to other worker nodes. And my ScalingConfig is

ScalingConfig(num_workers=num_workers, use_gpu=False),

For this reason, I took a look at the ray code and found that we can observe how to do machine scheduling and grouping through the following code

import ray
from ray.train._internal.worker_group import WorkerGroup
ray.init('auto')
wg = WorkerGroup(num_workers=50)
print( [w.metadata.hostname for w in wg.workers])

And another my question about TF_CONFIG also found the setting place

def _setup_tensorflow_environment(worker_addresses: List[str], index: int):
    """Set up distributed Tensorflow training information.
 
    This function should be called on each worker.
 
    Args:
        worker_addresses: Addresses of all the workers.
        index: Index (i.e. world rank) of the current worker.
    """
    tf_config = {
        "cluster": {"worker": worker_addresses},
        "task": {"type": "worker", "index": index},
    }
    os.environ["TF_CONFIG"] = json.dumps(tf_config)

Finally, I change the scaling_config to

ScalingConfig(num_workers=num_workers, use_gpu=False, placement_strategy='STRICT_SPREAD')

When using the above configuration, the program seems to be stuck and very slow. Is this normal? And, I changed it to SPREAD and the execution was completed quickly.

justinvyu · February 28, 2024, 6:06pm

@yydai The strict spread strategy isn’t feasible in this case, since it’d require you to have 50 nodes, where each worker gets placed onto a separate node. SPREAD is a softer strategy – however, you will mostly want to keep the default PACK strategy, since collective calls are faster if you have fewer inter-node connections.

See here Placement Groups — Ray 2.9.3

Topic		Replies	Views
Examples not running across multiple nodes in a cluster	2	919	June 23, 2023
Distributed pytorch on cluster Ray Clusters	4	541	June 9, 2021
How to use ray for multi-node training? Ray Clusters	1	1601	November 22, 2022
Example in ray.train for tensorflow distributed training?	1	31	November 24, 2024
Testing Ray Cluster via Manual Setup	6	883	January 22, 2021

Tensorflow and Pytorch cannot distributed training

Related topics