Illustration from Project Gutenberg

*The goal of cluster analysis is to group the observations in the data into *clusters* such that every datum in a cluster is more similar to other datums in the same cluster than it is to datums in other clusters. This is an analysis method of choice when annotated training data is not readily available. In this article, based on chapter 8 of Practical Data Science with R, the authors discuss one approach to evaluating the clusters that are discovered by a chosen clustering method. *

An important question when evaluating clusters is whether a given cluster is “real”-does the cluster represent actual structure in the data, or is it an artifact of the clustering algorithm? This is especially important with clustering algorithms like k-means, where the user has to specify the number of clusters *a priori*. It’s been our experience that clustering algorithms will often produce several clusters that represent actual structure or relationships in the data, and then one or two clusters that are buckets that represent “other” or “miscellaneous.” Clusters of “other” tend to be made up of data points that have no real relationship to each other; they just don’t fit anywhere else.

One way to assess whether a cluster represents true structure is to see if the cluster holds up under plausible variations in the dataset. The `fpc `

package has a function called `clusterboot()`

that uses bootstrap resampling to evaluate how stable a given cluster is. (For a full description of the algorithm, see Christian Henning, “Cluster-wise assessment of cluster stability,” Research Report 271, Dept. of Statistical Science, University College London, December 2006). `clusterboot()`

is an integrated function that both performs the clustering and evaluates the final produced clusters. It has interfaces to a number of R clustering algorithms, including both `hclust `

and `kmeans`

.

`clusterboot`

‘s algorithm uses the *Jaccard coefficient*, a similarity measure between sets. The Jaccard similarity between two sets A and B is the ratio of the number of elements in the intersection of A and B over the number of elements in the union of A and B. The basic general strategy is as follows:

- Cluster the data as usual.
- Draw a new dataset (of the same size as the original) by resampling the original dataset with replacement (meaning that some of the data points may show up more than once, and others not at all). Cluster the new dataset.
- For every cluster in the original clustering, find the most similar cluster in the new clustering (the one that gives the maximum Jaccard coefficient) and record that value. If this maximum Jaccard coefficient is less than 0.5, the original cluster is considered to be
*dissolved*-it didn’t show up in the new clustering. A cluster that’s dissolved too often is probably not a “real” cluster. - Repeat steps 2-3 several times.

The cluster stability of each cluster in the original clustering is the mean value of its Jaccard coefficient over all the bootstrap iterations. As a rule of thumb, clusters with a stability value less than 0.6 should be considered unstable. Values between 0.6 and 0.75 indicate that the cluster is measuring a pattern in the data, but there isn’t high certainty about which points should be clustered together. Clusters with stability values above about 0.85 can be considered highly stable (they’re likely to be real clusters).

Different clustering algorithms can give different stability values, even when the algorithms produce highly similar clusterings, so `clusterboot()`

is also measuring how stable the clustering algorithm is.

**The protein dataset**

To demonstrate `clusterboot()`

, we’ll use a small dataset from 1973 on protein consumption from nine different food groups in 25 countries in Europe. The original dataset can be found here. A tab-separated text file with the data can be found in this directory. The data file is called `protein.txt`

; additional information can be found in the file `protein_README.txt.`

The goal is to group the countries based on patterns in their protein consumption. The dataset is loaded into R as a data frame called `protein`

, as shown in the next listing.

protein <- read.table("protein.txt", sep="t", header=TRUE) summary(protein) # Country RedMeat WhiteMeat Eggs # Albania : 1 Min. : 4.400 Min. : 1.400 Min. :0.500 # Austria : 1 1st Qu.: 7.800 1st Qu.: 4.900 1st Qu.:2.700 # Belgium : 1 Median : 9.500 Median : 7.800 Median :2.900 # Bulgaria : 1 Mean : 9.828 Mean : 7.896 Mean :2.936 # Czechoslovakia: 1 3rd Qu.:10.600 3rd Qu.:10.800 3rd Qu.:3.700 # Denmark : 1 Max. :18.000 Max. :14.000 Max. :4.700 # (Other) :19 # Milk Fish Cereals Starch # Min. : 4.90 Min. : 0.200 Min. :18.60 Min. :0.600 # 1st Qu.:11.10 1st Qu.: 2.100 1st Qu.:24.30 1st Qu.:3.100 # Median :17.60 Median : 3.400 Median :28.00 Median :4.700 # Mean :17.11 Mean : 4.284 Mean :32.25 Mean :4.276 # 3rd Qu.:23.30 3rd Qu.: 5.800 3rd Qu.:40.10 3rd Qu.:5.700 # Max. :33.70 Max. :14.200 Max. :56.70 Max. :6.500 # # Nuts Fr.Veg # Min. :0.700 Min. :1.400 # 1st Qu.:1.500 1st Qu.:2.900 # Median :2.400 Median :3.800 # Mean :3.072 Mean :4.136 # 3rd Qu.:4.700 3rd Qu.:4.900 # Max. :7.800 Max. :7.900 # Use all the columns except the first # (Country). vars.to.use <- colnames(protein)[-1] # Scale the data columns to be zero mean # and unit variance. # The output of scale() is a matricx. pmatrix <- scale(protein[,vars.to.use]) # optionally, store the centers and # standard deviations of the original data, # so you can "unscale" it later. pcenter <- attr(pmatrix, "scaled:center") pscale <- attr(pmatrix, "scaled:scale")

Before running `clusterboot()`

we’ll cluster the data using a hierarchical clustering algorithm (Ward’s method):

# Create the distance matrix. d <- dist(pmatrix, method="euclidean") # Do the clustering. pfit <- hclust(d, method="ward") # Plot the dendrogram. plot(pfit, labels=protein$Country)

The dendrogram suggests five clusters, as shown in the figure below. You can draw the rectangles on the dendrogram using the command `rect.hclust(pfit, k=5)`

.

Let’s extract and print the clusters:

# A convenience function for printing out the # countries in each cluster, along with the values # for red meat, fish, and fruit/vegetable # consumption. print_clusters <- function(labels, k) { for(i in 1:k) { print(paste("cluster", i)) print(protein[labels==i,c("Country","RedMeat","Fish","Fr.Veg")]) } } # get the cluster labels groups <- cutree(pfit, k=5) # --- results -- > print_clusters(groups, 5) [1] "cluster 1" Country RedMeat Fish Fr.Veg 1 Albania 10.1 0.2 1.7 4 Bulgaria 7.8 1.2 4.2 18 Romania 6.2 1.0 2.8 25 Yugoslavia 4.4 0.6 3.2 [1] "cluster 2" Country RedMeat Fish Fr.Veg 2 Austria 8.9 2.1 4.3 3 Belgium 13.5 4.5 4.0 9 France 18.0 5.7 6.5 12 Ireland 13.9 2.2 2.9 14 Netherlands 9.5 2.5 3.7 21 Switzerland 13.1 2.3 4.9 22 UK 17.4 4.3 3.3 24 W Germany 11.4 3.4 3.8 [1] "cluster 3" Country RedMeat Fish Fr.Veg 5 Czechoslovakia 9.7 2.0 4.0 7 E Germany 8.4 5.4 3.6 11 Hungary 5.3 0.3 4.2 16 Poland 6.9 3.0 6.6 23 USSR 9.3 3.0 2.9 [1] "cluster 4" Country RedMeat Fish Fr.Veg 6 Denmark 10.6 9.9 2.4 8 Finland 9.5 5.8 1.4 15 Norway 9.4 9.7 2.7 20 Sweden 9.9 7.5 2.0 [1] "cluster 5" Country RedMeat Fish Fr.Veg 10 Greece 10.2 5.9 6.5 13 Italy 9.0 3.4 6.7 17 Portugal 6.2 14.2 7.9 19 Spain 7.1 7.0 7.2

There’s a certain logic to these clusters: the countries in each cluster tend to be in the same geographical region. It makes sense that countries in the same region would have similar dietary habits. You can also see that

- Cluster 2 is made of countries with higher-than-average red meat consumption.
- Cluster 4 contains countries with higher-than-average fish consumption but low produce consumption.
- Cluster 5 contains countries with high fish and produce consumption.

Let’s run `clusterboot()`

on the protein data, using hierarchical clustering with five clusters.

# load the fpc package library(fpc) # set the desired number of clusters kbest.p<-5 # Run clusterboot() with hclust # ('clustermethod=hclustCBI') using Ward's method # ('method="ward"') and kbest.p clusters # ('k=kbest.p'). Return the results in an object # called cboot.hclust. cboot.hclust <- clusterboot(pmatrix,clustermethod=hclustCBI, method="ward", k=kbest.p) # The results of the clustering are in # cboot.hclust$result. The output of the hclust() # function is in cboot.hclust$result$result. # # cboot.hclust$result$partition returns a # vector of clusterlabels. groups<-cboot.hclust$result$partition # -- results -- > print_clusters(groups, kbest.p) [1] "cluster 1" Country RedMeat Fish Fr.Veg 1 Albania 10.1 0.2 1.7 4 Bulgaria 7.8 1.2 4.2 18 Romania 6.2 1.0 2.8 25 Yugoslavia 4.4 0.6 3.2 [1] "cluster 2" Country RedMeat Fish Fr.Veg 2 Austria 8.9 2.1 4.3 3 Belgium 13.5 4.5 4.0 9 France 18.0 5.7 6.5 12 Ireland 13.9 2.2 2.9 14 Netherlands 9.5 2.5 3.7 21 Switzerland 13.1 2.3 4.9 22 UK 17.4 4.3 3.3 24 W Germany 11.4 3.4 3.8 [1] "cluster 3" Country RedMeat Fish Fr.Veg 5 Czechoslovakia 9.7 2.0 4.0 7 E Germany 8.4 5.4 3.6 11 Hungary 5.3 0.3 4.2 16 Poland 6.9 3.0 6.6 23 USSR 9.3 3.0 2.9 [1] "cluster 4" Country RedMeat Fish Fr.Veg 6 Denmark 10.6 9.9 2.4 8 Finland 9.5 5.8 1.4 15 Norway 9.4 9.7 2.7 20 Sweden 9.9 7.5 2.0 [1] "cluster 5" Country RedMeat Fish Fr.Veg 10 Greece 10.2 5.9 6.5 13 Italy 9.0 3.4 6.7 17 Portugal 6.2 14.2 7.9 19 Spain 7.1 7.0 7.2 # The vector of cluster stabilities. # Values close to 1 indicate stable clusters > cboot.hclust$bootmean [1] 0.7905000 0.7990913 0.6173056 0.9312857 0.7560000 # The count of how many times each cluster was # dissolved. By default clusterboot() runs 100 # bootstrap iterations. # Clusters that are dissolved often are unstable. > cboot.hclust$bootbrd [1] 25 11 47 8 35

The above results show that the cluster of countries with high fish consumption (cluster 4) is highly stable (cluster stability = 0.93). Clusters 1 and 2 are also quite stable; cluster 5 (cluster stability 0.76) less so. Cluster 3 (cluster stability 0.62) has the characteristics of what we’ve been calling the “other” cluster. Note that `clusterboot()`

has a random component, so the exact stability values and number of times each cluster is dissolved will vary from run to run.

Based on these results, we can say that the countries in cluster 4 have highly similar eating habits, distinct from those of the other countries (high fish and red meat consumption, with a relatively low amount of fruits and vegetables); we can also say that the countries in clusters 1 and 2 represent distinct eating patterns as well. The countries in cluster 3, on the other hand, show eating patterns that are different from those of the countries in other clusters, but aren’t as strongly similar to each other.

The `clusterboot()`

algorithm assumes that you have the number of clusters, *k*. Obviously, determining *k* will be harder for datasets that are larger than our protein example. There are ways of estimating *k*, but they are beyond the scope of this article. Once you have an idea of the number of clusters, however, `clusterboot()`

is a useful tool for evaluating the strength of the patterns that you have discovered.

For more about clustering, please refer to our free sample chapter 8 of *Practical Data Science with R*.

Categories: data science Practical Data Science Statistics Tutorials

### Nina Zumel

Data scientist with Win Vector LLC. I also dance, read ghost stories and folklore, and sometimes blog about it all.

A very important topic that you touched! Would it make sense to not only bootstrap the samples within one cluster, but to resample by allocating samples into other clusters? This way, would the Jaccard index indicate the stability of the whole tree and not only clusters?

Cheers,

Andrej

Thanks for your comment, Andrej. The entire data set is bootstrap resampled, and that resample is reclustered, with the same number of clusters k as the original clustering. Then each of the original clusters is matched to the “most similar” cluster in the new clustering. So while the Jaccard index only measures the stability of a single cluster (from the original clustering), the set of Jaccard indices tells us about the stability of the entire clustering procedure on that data set.

Hi Nina,

I think you would be interested having a look at the dendextend R package:

https://cran.r-project.org/web/packages/dendextend/vignettes/introduction.html

Specifically look at the visualization for pvclust:

https://cran.r-project.org/web/packages/dendextend/vignettes/introduction.html#pvclust

Which might be used also for what you have been doing.

The package also has a short 2 page (open access) paper in bioinformatics:

http://bioinformatics.oxfordjournals.org/content/early/2015/08/18/bioinformatics.btv428

With regards,

Tal

Thanks, Tal! That looks like a really useful package (Looks like it wasn’t around when we wrote Practical Data Science with R). We will definitely check it out.

Thanks Nina,

Take care :)

Tal

Very clear presentation, thanks. May I ask is there a way to find which member of a group is the most stable? I mean a member that can represent the whole group? Or ranking of members of a group ?

Thanks for your question!

For a given cluster O in the original clustering, the Jaccard coefficients tell you for each re-clustering which new cluster is most similar to O. Right now,

`clusterboot`

only uses the value of that highest Jaccard coefficient to make the match (call the best new cluster N). But you could also look at how many times each member of O appears in N. This would give you a ranking of the “representativeness” of each member of O — if a certain member of O always appears in N, then you could consider it representative of O.