Chapter 3 K-means
2021-12-25 updated
3.1 Case 1
K-means clustering: It is a method to cluster n points into k clusters based on the means of shortest distance. (cautsion: do not confused it with K-nearest clustering). We use the demo data sets “USArrests”.
data("USArrests")
<- USArrests
df head(df)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
# scaling the data
<- scale(df)
df head(df)
## Murder Assault UrbanPop
## Alabama 1.24256408 0.7828393 -0.5209066
## Alaska 0.50786248 1.1068225 -1.2117642
## Arizona 0.07163341 1.4788032 0.9989801
## Arkansas 0.23234938 0.2308680 -1.0735927
## California 0.27826823 1.2628144 1.7589234
## Colorado 0.02571456 0.3988593 0.8608085
## Rape
## Alabama -0.003416473
## Alaska 2.484202941
## Arizona 1.042878388
## Arkansas -0.184916602
## California 2.067820292
## Colorado 1.864967207
We use factoextra package to create beautiful clusters visualization
We also use fviz_nbclust
to determine the optimal number of clusters
set.seed(123)
fviz_nbclust(df, kmeans, method="wss") + geom_vline(xintercept = 4, linetype = 2)
set.seed(123)
<- kmeans(df, centers = 4, nstart = 25)) (km.res
## K-means clustering with 4 clusters of sizes 8, 13, 16, 13
##
## Cluster means:
## Murder Assault UrbanPop
## 1 1.4118898 0.8743346 -0.8145211
## 2 -0.9615407 -1.1066010 -0.9301069
## 3 -0.4894375 -0.3826001 0.5758298
## 4 0.6950701 1.0394414 0.7226370
## Rape
## 1 0.01927104
## 2 -0.96676331
## 3 -0.26165379
## 4 1.27693964
##
## Clustering vector:
## Alabama Alaska
## 1 4
## Arizona Arkansas
## 4 1
## California Colorado
## 4 4
## Connecticut Delaware
## 3 3
## Florida Georgia
## 4 1
## Hawaii Idaho
## 3 2
## Illinois Indiana
## 4 3
## Iowa Kansas
## 2 3
## Kentucky Louisiana
## 2 1
## Maine Maryland
## 2 4
## Massachusetts Michigan
## 3 4
## Minnesota Mississippi
## 2 1
## Missouri Montana
## 4 2
## Nebraska Nevada
## 2 4
## New Hampshire New Jersey
## 2 3
## New Mexico New York
## 4 4
## North Carolina North Dakota
## 1 2
## Ohio Oklahoma
## 3 3
## Oregon Pennsylvania
## 3 3
## Rhode Island South Carolina
## 3 1
## South Dakota Tennessee
## 2 1
## Texas Utah
## 4 3
## Vermont Virginia
## 2 3
## Washington West Virginia
## 3 2
## Wisconsin Wyoming
## 2 3
##
## Within cluster sum of squares by cluster:
## [1] 8.316061 11.952463 16.212213 19.922437
## (between_SS / total_SS = 71.2 %)
##
## Available components:
##
## [1] "cluster" "centers"
## [3] "totss" "withinss"
## [5] "tot.withinss" "betweenss"
## [7] "size" "iter"
## [9] "ifault"
There are 7 Avaliable components you can access it.
Compute the mean of each variables by clustering using the original data:
aggregate(USArrests, by=list(cluster=km.res$cluster), mean)
## cluster Murder Assault UrbanPop
## 1 1 13.93750 243.62500 53.75000
## 2 2 3.60000 78.53846 52.07692
## 3 3 5.65625 138.87500 73.87500
## 4 4 10.81538 257.38462 76.00000
## Rape
## 1 21.41250
## 2 12.17692
## 3 18.78125
## 4 33.19231
Display results with your cluster result:
<- cbind(USArrests, cluster = km.res$cluster)
dd head(dd)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
## cluster
## Alabama 1
## Alaska 4
## Arizona 4
## Arkansas 1
## California 4
## Colorado 4
Plot the clustering result using factoextra package
library(factoextra)
fviz_cluster(km.res, df, palette="Set2", ggtheme = theme_minimal())
#points(km.res$centers, col = 1:2, pch = 8, cex=2)
Reference: here
3.2 Case 2
Use iris data for a second demo
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
We use the features with “Length” to do a kmeans clustering:
<- grep("Length", names(iris))
i <- iris[, i]
x <- kmeans(x, centers = 3, nstart = 10)
cl plot(x, col = cl$cluster)
3.3 Hierarchical clustering
<- dist(iris[, 1:4]) # calculate the euclidean distance
d <- hclust(d)
hcl hcl
##
## Call:
## hclust(d = d)
##
## Cluster method : complete
## Distance : euclidean
## Number of objects: 150
plot(hcl)
Usually, you will need to cut the tree to define the clusters with cutree()
.
cutree(hcl, h=1.5)) # at certain height (
## [1] 1 1 1 1 1 2 1 1 1 1 2 2 1 1 2 2 2 1 2
## [20] 2 2 2 1 2 2 1 2 1 1 1 1 2 2 2 1 1 2 1
## [39] 1 1 1 1 1 2 2 1 2 1 2 1 3 3 3 4 3 4 3
## [58] 5 3 4 5 4 4 3 4 3 4 4 3 4 6 4 6 3 3 3
## [77] 3 3 3 4 4 4 4 6 4 3 3 3 4 4 4 3 4 5 4
## [96] 4 4 3 5 4 7 6 8 7 7 8 4 8 7 8 7 6 7 6
## [115] 6 7 7 8 8 3 7 6 8 6 7 8 6 6 7 8 8 8 7
## [134] 6 6 8 7 7 6 7 7 7 6 7 7 7 6 7 7 6
cutree(hcl, k=2) # with certain number of clusters
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [20] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [39] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 2 1 2
## [58] 1 2 1 1 1 1 2 1 2 1 1 2 1 2 1 2 2 2 2
## [77] 2 2 2 1 1 1 1 2 1 2 2 2 1 1 1 2 1 1 1
## [96] 1 1 2 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2
## [115] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [134] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
plot(hcl)
abline(h = 3.9, col="blue")
You can check the result using identical()
identical(cutree(hcl, k=3), cutree(hcl, h=3.9))
## [1] TRUE