## I Introduction

Wireless sensor networks are now used in a wide range of applications in medicine, telecommunications, and environmental domains, see [2] for a survey. For instance, they are employed for human health monitoring [3], activity recognition on home environments [4], spectrum sensing in cognitive radio [5], and so forth. In most applications, the network is asked to perform a given estimation, detection or learning task over measurements collected by sensors. In this paper, we consider clustering as a particular learning task. The purpose of clustering is to divide data into clusters such that the data inside a given cluster are similar with each other and different from the data belonging to the other clusters [6].

In this paper, we would like to take the following two major practical constraints into account. First, sensor networks usually require a fusion center, whose twofold role is to receive the data and achieve the desired learning task. In this case, we say that the learning task is centralized. However, it is not always practical to set up a fusion center, especially in recent applications involving autonomous drones or robots [7]. In such applications, the sensors should perform the learning task by themselves in a fully decentralized setup, without resorting to any fusion center. In this case, we say that the learning task performed by the network is decentralized.

Second, it is crucial to reduce the sensors energy consumption in order to increase the network lifetime. Since most of the energy of the sensors is consumed by transmitting data via the communication system, it is highly desirable to transmit these data in a compressed form so as to lower the energy consumption. In addition, because the objective of a clustering task is not to reconstruct all the sensor measurements but only to cluster them, performing this task on compressed data is all the more desirable as it avoids costly decoding operations.

According to the foregoing, our focus is thus on the design of clustering algorithms that work directly over compressed measurements in a fully decentralized setup. In such a decentralized setup, each sensor must perform the clustering with only partial observations of the available compressed measurements, while minimizing the amount of data exchanged in the network.

Clustering over compressed measurements was recently addressed in [8, 9, 10], for the K-means algorithm only. The K-means algorithm is very popular due to its simplicity and effectiveness [6]. It makes no assumption on the signal model of the measurement vectors that belong to a cluster and, as such, it is especially relevant for applications such as document classification [11], information retrieval, or categorical data clustering [12]. However, the K-means algorithm requires prior knowledge of the number of clusters, which is not acceptable in a network of sensors where the data is non-stationary and may vary from one data collection to another. When is unknown, one could think of applying a penalized method [13] that permits to jointly estimate and perform the clustering. Unfortunately, this method requires running the K-means algorithm several times with different numbers of clusters, which may be quite energy consuming. As another issue, the K-means algorithm must be initialized properly in order to get a chance to correctly retrieve the clusters. Proper initialization can be obtained with the K-means++ procedure [14], which requires computing all the two-by-two distances between all the measurement vectors of the dataset. As a result, the K-means++ procedure is not affordable in a decentralized setup. It is worth mentioning that the variants of K-means such as Fuzzy K-means [15] suffer from the same two issues.

Other clustering algorithms such as DB-SCAN [16] and OPTICS [17] may appear as suitable candidates for decentralized clustering since they do not need the number of clusters. However, they require setting two parameters that are the maximum distance between two points in a cluster and the minimum number of points per cluster. These parameters have a strong influence on the clustering performance, but they can hardly be estimated and they must be chosen empirically [18].

Therefore, our purpose is to derive a solution that bypasses the aforementioned issues for clustering compressed data in a decentralized setup. In this respect, we proceed in two main steps. We begin by introducing a centralized clustering algorithm that circumvents the drawbacks of the standard algorithms. This algorithm is hereafter named CENTREx as it performs the clustering without prior knowledge of the number of clusters. In a second step, we devise a decentralized version DeCENTREx of this algorithm.

Crucially, CENTREx derives from a model-based theoretical approach. We hereafter consider the same Gaussian model as in [19, 20]. In this model recalled in Section II, the measurement vectors belonging to a given cluster are supposed to be the cluster centroid corrupted by additive Gaussian noise. In our model, the Gaussian noise is not necessarily independent and identically distributed (i.i.d.) as it is described by a non-diagonal covariance matrix. Here, in contrast to [19, 20], we will not assume a known number of clusters, but suppose that the covariance matrix is known. This assumption was already made for clustering in [15, 21] in order to choose the parameters for the functions that compute the cluster centroids. Further, in a sensor network context, this assumption is more acceptable than a prior known number of clusters. Indeed, in many signal processing applications, the noise covariance matrix can be estimated, either on-the-fly or from preliminary measurements, via many parametric, non-parametric, and robust methods (see [22, 23, 24], among others).

On the basis of this model, a new cost function for clustering over compressed data is introduced in Section III. This cost function generalizes the function introduced in [15] for clustering over non-compressed data. In [15], the choice of the cost function was justified by an analogy with M-estimation [25], but it was not supported by any theoretical arguments related to clustering. On the opposite, the novel theoretical analysis we conduct in Section III shows that, under asymptotic conditions, the compressed cluster centroids are the only minimizers of the introduced cost function. The cost function depends on a weight function that must verify some properties deriving from the theoretical analysis. As exposed in Section IV, the weight function is chosen as the p-value of a Wald hypothesis test [26]. This p-value measures the plausibility that a measurement vector belongs to a given cluster. In addition, its expression does not depend on any empirical parameter that could influence the final clustering performance.

In Sections VI and VII, we describe the clustering algorithms CENTREx and DeCENTREx that derive from our mathematical analysis.
Given the compressed measurements, both algorithms estimate the compressed cluster centroids one after each other by computing the minimizers of our cost function, even when the number of minimizers is *a priori* unknown.
The clustering is then performed by assigning each measurement to the cluster with the closest estimated centroid.
The decentralized version DeCENTREx takes advantage of the fact that our approach does not require prior knowledge of the number of clusters, and that it does not suffer from initialization issues.
We show that, due to these advantages, the amount of data to be exchanged between sensors for DeCENTREx is much lower than for decentralized K-means [27].
Simulation results presented in Section VIII show that our algorithms give much better performance than DB-Scan and that they only suffer a small loss in performance compared to K-means with known . We also observe that our algorithms give the same level of clustering performance as K-means with *a priori* unknown, while requiring less data exchange.

## Ii Signal model and notation

In this section, we introduce our notation and assumptions for the signal model and the data collected by sensors in the network. We also recall the definition of the Mahalanobis norm which will be useful in the theoretical analysis proposed in the paper.

### Ii-a Signal Model

In this paper, the notation denotes the set of integers between and . Consider a set of independent and identically distributed (i.i.d.) -dimensional random Gaussian vectors with same covariance matrix . We consider that the measurement vectors are split into clusters defined by deterministic centroids , with for each . Accordingly, we assume that for each , there exists such that and we say that belongs to cluster . In the following, we assume that the covariance matrix is known prior to clustering.

The measurement vectors are all multiplied by a sensing matrix , which produces compressed vectors , . As a result, , where represents the compressed centroids. Here, the matrix is known and it is the same for all the sensors. It is assumed to have full rank so that is injective. This matrix performs compression whenever . The theoretical analysis presented in the paper applies whatever the considered full rank matrix, and in our simulations, we will consider several different choices for .

In the paper, the data repartition in the network will depend on the considered setup. In the centralized setup, we will assume that all the compressed vectors are available at a fusion center. In the decentralized setup, we will assume that the network is composed by sensors which all observe a different subset of the measurement vectors.

In the following, we start by describing the centralized version of the algorithm. We assume that the centroids , and their compressed versions , are unknown. We want to propose an algorithm that groups the compressed measurement vectors into clusters, without prior knowledge of the number of clusters. The first step of our algorithm consists of estimating the compressed centroids . Our centroid estimation method relies on the Mahalanobis norm whose properties are recalled now.

### Ii-B Mahalanobis norm

Consider an positive-definite matrix . The Mahalanobis norm is defined for any by setting . If

is the identity matrix

, the Mahalanobis norm is the standard Euclidean norm in . More generally, since is positive definite, it can be decomposed as where is a diagonal matrix whose diagonal values arethe eigenvalues of

andcontains the corresponding eigenvectors. By setting

it is easy to verify that:(1) |

According to (1), is called the whitening matrix of .

## Iii Centroid estimation

In this section, we introduce a new cost function for the estimation of the compressed centroids from the measurement vectors . We then present our theoretical analysis that shows that the compressed centroids are the only minimizers of the cost function.

### Iii-a Cost Function for centroid estimation

Consider an increasing, convex and differentiable function that verifies . First assume that the number of clusters is known, and consider the following cost function for the estimation of the compressed centroids:

(2) |

with . This cost function generalizes the one introduced in [15] for centroid estimation when is known. In [15], the clustering was performed over i.i.d. Gaussian vectors, and the particular case was considered. In contrast, our analysis assumes a general positive-definite matrix , which will permit to take into account both a non-diagonal covariance matrix and the correlation introduced by the compression matrix . In addition, [15] only considers the particular case , where is a parameter that has to be chosen empirically. On the opposite, here, we consider a class of possible functions , and the properties that these functions should verify will be exposed in the subsequent theoretical analysis. Note that the approach in [15] was inspired by the M-estimation theory [25].

In order to estimate the centroids, we want to minimize the cost function (2) with respect to . Since is convex by the properties of and , is differentiable, and is invertible, standard matrix differentiation [28, Sec. 2.4] allow to show that the minimizer of (2) should verify

(3) |

where is hereafter called the -dimensional weight function and is called the scalar weight function. Unless necessary, we generally drop the adjective ’scalar’ in the sequel.

Solving (3) amounts to looking for the fixed-points of the function defined as

(4) |

In [15], no theoretical argument was given to demonstrate that the introduced cost function was appropriate for the estimation of the cluster centroids. On the opposite, in the following, we show the following strong result: the centroids are the only fixed points of under asymptotic conditions, provided that the weight function verifies certain properties.

Perhaps surprisingly, the expression of depends neither on the considered cluster , nor on the number of clusters . The foregoing suggests that, even when is unknown, estimating the centroids can be performed by seeking the fixed points of . This claim is theoretically and experimentally verified below for a certain class of matrices .

### Iii-B Fixed-point analysis

The following proposition shows that the compressed centroids are the only fixed points of the function defined in (4).

###### Proposition 1.

With the same notation as above, let be the number of data belonging to cluster and set . Assume that there exist such that . Assume also that the function is non-null, non-negative, continuous, bounded and verifies:

(5) |

For any positive definite matrix proportional to , for any , and any :

###### Proof:

For any , let be the compressed vectors that belong to cluster . Consider a given matrix that is positive definite and proportional to . We can write in the form:

(6) |

The random function (6) can then be rewritten as with:

(7) |

Therefore, , with .

For any , we set , and . With this notation, we have as well as:

(8) |

and:

(9) |

By the strong law of large numbers, it follows from (

8) and (9) that for any ,and . Assume now that . It follows from (5) and Lemma 1 of Appendix A that:

Because , the left hand side (lhs) to the equality above is if and only if

Proposition 1 shows that the centroids are the unique fixed points of the function , when the sample size and the distances between centroids tend to infinity. This result means that, at least asymptotically, no vector other than a centroid can be a fixed point of . This result, as well as the fact that the expression of depends on neither nor , will allow us to derive a clustering algorithm that does not require prior knowledge of .

We however wonder about the statistical behavior of the fixed points of in non-asymptotic situations. In particular, the non-asymptotic fixed-points statistical model derived in the next section will help us refine our clustering algorithm. Although derived from some approximations, this model will allow us to choose weight functions that verify the conditions of Proposition 1 and that are also suitable when the sample size and the distances between centroids are finite.

### Iii-C Fixed point statistical model

Under the assumptions of Proposition 1, a fixed point of provides us with an estimated centroid for some unknown centroid . The following claim gives the statistical model we consider for the estimated centroids . This result is given by a claim rather than a proposition, since its derivation is based on several approximations.

###### Claim 1.

For any positive definite matrix with and all , we approximate the statistical model of as

(11) |

where is the number of compressed vectors in cluster and

(12) |

with .

#### Derivation

In order to model the estimation error, we can start by writing . Of course, will be all the more small than approximates accurately . We can then write that , where

(13) |

The term is likely to be small if , , , that is if the function reduces strongly the influence of data from clusters other than . To finish, in absence of noise, we would directly have , but the presence of noise induces that . Finally, we have .

We now derive a model for , and we keep the same notation as in the proof of Proposition 1. In particular, with for any and any . It follows from (13) that with and

. The random variables

are iid and we proceed by computing their mean and covariance matrix.Given any , for any . As above, let be the whitening matrix of . According to (1), where and . It then follows from (10) that . We derive from the foregoing and Lemma 2 of Appendix B that and thus, that .

With the same notation as above, the covariance matrix of any is that of . Since this random vector is centered, its covariance matrix equals

Lemma 4 of Appendix D implies that

By the central limit theorem,

thus converges in distribution to . By the weak law of large numbers and (1) again,converges in probability to

. Slutsky’s theorem [29, Sec. 1.5.4, p. 19] implies that converges in distribution to , where is defined in (12).Therefore, is asymptotically Gaussian so that . We do not know how to model and yet. We merely know that the contributions of these two types of noise are small under the asymptotic conditions of Proposition 1. As a result, we do not take the influence of and into account and model the statistical behavior of by (11). ∎

In the above model, can be calculated by Monte-Carlo simulations, and we will explain in the algorithm description how we estimate . Although (11) may be a coarse approximation, since and are not necessarily negligible compared to , the experimental results reported in Section VIII support the practical relevance of the approach.

At the end, all the results of this section were derived from a generic scalar weight function , and the theoretical analysis provided the properties that should satisfy. In the following, we choose a weight function that satisfies these properties and that is suitable for clustering.

## Iv Weight function

The scalar weight function proposed in [15] verifies the properties required in Proposition 1. However, in this weight function, the parameter must be chosen empirically and its optimal value varies with and the noise parameters. A poor choice of can dramatically impact the performance of the clustering algorithm proposed in [15].

In contrast, we propose new weight functions whose expressions are known whatever the dimension and noise parameters. These new weight functions are devised as the p-values of Wald’s hypothesis tests for testing the mean of a Gaussian [26]. In this section, we thus begin by recalling the basics about Wald’s test for testing the mean of a Gaussian and we introduce the p-value for this test. We then derive the weight functions that will be used in our clustering algorithm.

### Iv-a p-value of Wald’s test for testing the mean of a Gaussian

Let , where the covariance matrix is positive definite and is unknown. Consider the problem of testing whether is centered or not. This problem can be summarized as:

(14) |

Recall that a non-randomized test is any measurable map from to . Given a realization of , the value returned by is the index of the hypothesis considered to be true. We say that accepts (resp. ) at if (resp. ). Given , let be the unique real value such that:

(15) |

where is the Generalized Marcum Function [30]. According to [26, Definition III & Proposition III, p. 450], the non-randomized test defined for any as

(16) |

guarantees a false alarm probability for the problem described by (14). Although there is no Uniformly Most Powerful (UMP) test for the composite binary hypothesis testing problem (14) [31, Sec. 3.7], turns out to be optimal with respect to several optimality criteria and within several classes of tests with level [32, Proposition 2]. In particular, is UMP with size among all spherically invariant tests and has Uniformly Best Constant Power (UBCP) on the spheres centered at the origin of [26, Definition III & Proposition III, p. 450]. It is hereafter called a Wald test, without recalling explicitly the level at which the testing is performed.

In Appendix C, we show that test has a p-value function defined for each by:

(17) |

The p-value

can be seen as a measure of the plausibility of the null hypothesis

given a realization of [31, Sec. 3.3].### Iv-B Weight function for clustering

We now define the weight function that will be used in our clustering algorithm. The expression of this weight function depends on the pvalue function defined in (17).

Henceforth, let be the function defined for any by Because this function is continuous, bounded by and satisfies [30], it satisfies the properties required in Proposition 1. We therefore choose it as our scalar weight function . Its corresponding -dimensional weight function is therefore defined for any by:

(18) |

Proposition 1 and Claim 1 hold for any matrix proportional to . From (18), we further observe that the -dimensional weight function depends on the choice of the matrix , which itself depends on the considered Wald test (14). This is why we now introduce specific Wald tests that will be considered for clustering. These tests will allow us to specify the matrices that will be used in our algorithms.

## V Hypothesis tests for clustering

In this section, we introduce all the hypothesis tests that will be used in our clustering algorithm. The first three tests directly derive from the Wald test introduced in Section IV-A and they will be used mainly for the derivation of the weight functions that are used in our algorithm. The fourth considered test will serve to decide whether two estimated centroids and actually correspond to the same centroid . Since it is not a Wald test, we completely define it in this section.

### V-a Wald’s tests for clustering

#### Test n°1

First consider two compressed vectors and , where and designate the centroids of the clusters to which and belong, respectively. In order to decide whether these two vectors belong to the same cluster, we can test the mean of the vector . This problem can be solved by the Wald hypothesis test described in Section IV-A, with .

#### Test n°2

Now assume that we want to decide whether the compressed vector belongs to cluster described by centroid . This problem can be addressed by testing the mean of the vector , which can be solved by the Wald’s test of Section IV-A with .

#### Test n°3

Our clustering algorithm will have to test whether belongs to cluster , only knowing an estimate of . According to Claim 1, given a positive definite matrix with , the estimated centroid is modeled as . We must thus choose a value of to specify the matrix used in the -dimensional weight function . In our clustering algorithm described in Section VI, we will actually consider two -dimensional weight functions specified by two different values of . The first weight function will be given by (18) with (Test n°1), and the second one will be given by (18) with (Test n°2). As a result, we hereafter consider .

Further, in order to decide whether belongs to cluster , we will assume that and are independent. In practice, will be calculated by using a large number of data so that the influence of one can be neglected. Consequently, in order to make the decision, we will test the mean of the vector . This problem can be solved by the Wald hypothesis test described in Section IV-A, with . Note that if is big, we can approximate , and Test n°3 degenerates into Test n°2.

### V-B Test n°4: hypothesis test for centroid fusion

Consider two fixed points and of , where is chosen according to Test n°1 or Test n°2. These fixed points are estimates of two centroids and . The centroid estimation method used in our algorithm will sometimes result in estimating several times the same centroid. This is why our algorithm will also contain a fusion step that will have to decide whether and are estimates of the same centroid. The fusion step will thus have to decide whether and are different or not, in which latter case and should be merged. Merging estimates of two different centroids may result in an artifact significantly far from the two true centroids. On the other hand, failing to merge estimates of the same centroid will only result in overestimating the number of centroids. This is why we would like to devise an hypothesis test from which the null hypothesis is rather than as in the Wald test. With this choice for the null hypothesis, the alternative hypothesis is .

In order to test against , we proceed similarly as above by considering . In contrast to the three tests discussed in the previous subsections, the random vector is not necessarily Gaussian. Indeed, and are not independent under either or . Therefore, the Wald test [26, Definition III & Proposition III, p. 450] does not apply. However, under , by considering that the scalar weight function tends to put significantly smaller weights on data from clusters other than (resp. ) to calculate (resp. ), we assume the independence of and . By further taking the statistical model of Claim 1 into account, we thus write that, under , with

. We can then proceed as usual in statistical hypothesis testing by exhibiting a test maintaining the false alarm probability of incorrectly rejecting

below a given significance level . Specifically, the test defined for every by:(19) |

where is determined according to (15), guarantees a false alarm probability less than or equal to for testing against . Indeed, this false alarm probability is:

where is the whitening matrix of . Since the generalized Marcum function increases with its first argument [30], .

## Vi Centralized Clustering Algorithm

This section describes our centralized clustering algorithm CENTREx that applies to compressed data. This algorithm derives from the theoretical analysis introduced in the paper. In this section, we first present the three main steps of this algorithm (centroid estimation, fusion, and classification). We then describe each of these steps into details. We also explain how to choose two empirical parameters that are the false alarm probability and the stopping condition , and we discuss their influence on the clustering performance.

### Vi-a Algorithm description

The objective of our clustering algorithm is to divide the set of received compressed vectors into clusters, where is unknown *a priori*.
The algorithm can be decomposed into three main steps.
The first step consists of estimating compressed centroids from .
The centroids are estimated one after each other by seeking the fixed points of defined in (4) (see Section VI-B).
Unfortunately, due to initialization issues, this process
may estimate several times the same centroids. This is why the algorithm then applies a fusion step.
At this step, the algorithm looks for the estimated that correspond to the same centroid by applying Test n°4
to every pair (see Section VI-C). This yields a reduced set of estimated centroids.
To finish, the algorithm performs a classification step associating each compressed vector to the cluster with the closest centroid (see Section VI-D).
We now describe into details each of these steps.

### Vi-B Centroid estimation

In this section, we introduce the method we use in order to estimate the centroids one after each other. Initialize by the set of centroids estimated by the algorithm. Also, initialize by the set of vectors that are considered as marked, where a marked vector cannot be used anymore to initialize the estimation of a new centroid.

The centroids are estimated one after the other, until . When the algorithm has already estimated centroids, we have . In order to estimate the -th centroid, the algorithm picks a measurement vector at random in the set and initializes the estimation process with . In order to estimate as a fixed point of (4), the algorithm should recursively compute , see [25]. Here, we consider the following strategy for the matrix that is used in the recursion. In our algorithm, the first iteration is computed as . This corresponds to as given by Test n°1 in Section V-A, which comes from the fact that the centroid estimation is initialized with . From iteration , the recursion is computed as , which corresponds to as given by Test n°2 in Section V-A. This choice comes from the fact that is already a rough estimate of . Here, it would be better to consider the value of given by Test n°3 rather than Test n°2, but cannot be estimated at this stage of the algorithm. It is worth mentioning that this strategy (changing the matrix from iteration to iteration ) led to good clustering performance on all the simulations we considered, with various dimensions and , number of clusters , matrices , etc.

The recursion stops when , where (Test n°2) and is the stopping condition. The newly estimated centroid is given by , where represents the final iteration. To finish, the set of estimated centroids is updated as .

Once the centroid is estimated, the algorithm marks all the vectors that belong to cluster . For this, the algorithm applies Test n°2 of Section V-A to each , . Here again, we apply Test n°2 instead of Test n°3, because the value cannot be estimated at this stage of the algorithm. As a result, we assume that . All the observations that accept the null hypothesis under this test are grouped into the set . The set of marked vectors is then updated as . Note that the measurement vector , which serves for initialization, is also marked in order to avoid initializing again with the same vector. If , the algorithm estimates the next centroid . Otherwise, the algorithm moves to the fusion step.

### Vi-C Fusion

Once and, say, centroids have been estimated, the algorithm applies a so-called fusion step to identify the centroids that may have been estimated several times during the centroid estimation phase.
Indeed, in non-asymptotic situations, the estimated centroids issued from the centroid estimation phase are not guaranteed to be remote from each other and experiments show that the estimation phase tends to over-estimate the true number of centroids.

At this step, the algorithm first sets . It then applies Test n°4 defined in Section V-B to every pair of estimated centroids .
Since the cluster sizes and required by Test n°4 are unknown, we replace them by estimates and . These estimates are obtained by counting the number of vectors respectively assigned to clusters and during the marking operation.
When Test n°4 accepts hypothesis , the algorithm sets and removes from .
At the end, the number of estimated centroids is set as the cardinal of the final and the elements of are re-indexed in order to get .

### Vi-D Classification

Once centroids have been estimated, the algorithm moves to the classification step.
Denote by the set of measurement vectors assigned to cluster .
Each vector is assigned to the cluster whose centroid is the closest to , *i.e.*, , where (Test n°2, assuming that is very close to ).
Here, using this condition instead of an hypothesis test forces each measurement vector to be assigned to a cluster.

### Vi-E Empirical parameters

The described algorithm depends on some parameters and . In this section, we describe how to choose these parameters.

#### Vi-E1 Parameter

The false alarm probability participates to the definition of the weight function in Section IV. However, we observed in all our simulations that this parameter does not influence much the clustering performance. More precisely, we observed that any value of equal or lower than leads to the same clustering performance. The parameter

would be more useful in the case of outliers in the dataset, which is out of the scope of the paper.

#### Vi-E2 Parameter

The parameter defines the stopping criterion in the estimation of the centroids. As for the false alarm probability, does not influence much the decoding performance, although it can increase the number of iterations for the estimation when it is too small. In our simulations, we observed that can be set to any value between and without affecting the clustering performance.

At the end, our algorithm CENTREx shows three interesting characteristics compared to other existing clustering algorithms. First, it does not require prior knowledge of , since the centroids are estimated one after the other by looking for all the fixed points of the function . Second, it is not very sensitive to initialization, since the fusion step mitigates the effects of a bad initialization. Third, the empirical parameters and do not influence much the clustering performance. For these three reasons, the algorithm works in one run and does not need to be repeated several times in order to estimate and lower the initialization issues (like K-means), or to set up some empirical parameters (like DB-Scan). As a result, it appears as a suitable candidate for use in a fully decentralized setup.

## Vii Decentralized Clustering Algorithm

In this section, we consider a network of sensors where sensor observes measurement vectors, and . We denote by the set of measurement vectors observed by sensor . We assume that

Comments

There are no comments yet.