K-Means Analysis
This project will use the Wholesale customer Data Set from: https://archive.ics.uci.edu/ml/datasets/Wholesale+customers. The data has 440 orbs and 8 variables. The data has the Channel and Region as integers but they are categorical in nature with 2 channels and 3 regions. We think this may need to be taken out of the analysis but will leave them in for now. A general observation about the variables there are outliers in the 6 major variables: fresh, milk, grocery, frozen, detergents_paper, delicassen. There are plot of each variable. In addition, to the outliers it is noted that the data is dense at the bottom of each of the variable graphs. “Fresh” seems to be less dense than the rest and “Delicassen” is the thickest density of points at the bottom. Initial thoughts are the clusters are going to be somewhere at the bottom of the graph.
library(stats)
library(fpc)
library(factoextra)
library(cluster)
library(fpc)
library(NbClust)
library(ggplot2)
Wholesale_customers_data = read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale%20customers%20data.csv", row.names=NULL)
df <- Wholesale_customers_data
ls(df)
## [1] "Channel" "Delicassen" "Detergents_Paper" "Fresh"
## [5] "Frozen" "Grocery" "Milk" "Region"
str(df)
## 'data.frame': 440 obs. of 8 variables:
## $ Channel : int 2 2 2 1 2 2 2 2 1 2 ...
## $ Region : int 3 3 3 3 3 3 3 3 3 3 ...
## $ Fresh : int 12669 7057 6353 13265 22615 9413 12126 7579 5963 6006 ...
## $ Milk : int 9656 9810 8808 1196 5410 8259 3199 4956 3648 11093 ...
## $ Grocery : int 7561 9568 7684 4221 7198 5126 6975 9426 6192 18881 ...
## $ Frozen : int 214 1762 2405 6404 3915 666 480 1669 425 1159 ...
## $ Detergents_Paper: int 2674 3293 3516 507 1777 1795 3140 3321 1716 7425 ...
## $ Delicassen : int 1338 1776 7844 1788 5185 1451 545 2566 750 2098 ...
summary(df)
## Channel Region Fresh Milk
## Min. :1.000 Min. :1.000 Min. : 3 Min. : 55
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.: 3128 1st Qu.: 1533
## Median :1.000 Median :3.000 Median : 8504 Median : 3627
## Mean :1.323 Mean :2.543 Mean : 12000 Mean : 5796
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.: 16934 3rd Qu.: 7190
## Max. :2.000 Max. :3.000 Max. :112151 Max. :73498
## Grocery Frozen Detergents_Paper Delicassen
## Min. : 3 Min. : 25.0 Min. : 3.0 Min. : 3.0
## 1st Qu.: 2153 1st Qu.: 742.2 1st Qu.: 256.8 1st Qu.: 408.2
## Median : 4756 Median : 1526.0 Median : 816.5 Median : 965.5
## Mean : 7951 Mean : 3071.9 Mean : 2881.5 Mean : 1524.9
## 3rd Qu.:10656 3rd Qu.: 3554.2 3rd Qu.: 3922.0 3rd Qu.: 1820.2
## Max. :92780 Max. :60869.0 Max. :40827.0 Max. :47943.0
plot(df$Fresh)
plot(df$Milk)
plot(df$Grocery)
plot(df$Frozen)
plot(df$Detergents_Paper)
plot(df$Delicassen)
table(df$Channel)
##
## 1 2
## 298 142
table(df$Region)
##
## 1 2 3
## 77 47 316
groc <- df
The data needs to be normalize or scaled since we have values of 1 to 112,151. It was assumed early on that the normalization function could be used, however, all of the different tutorials used the scale function which would give some values below 0 up to 16.45. This is not known why it is important to have negative values. Normalize would place everything in 0 to 1. Perhaps this would be too dense of the points?
# normalize <- function(x) {return ((x - min(x)) / (max(x) - min(x)))}
#df_z <- as.data.frame(lapply(df[c("Fresh", "Milk","Grocery","Frozen", "Detergents_Paper", "Delicassen")], normalize))
df_z <- as.data.frame(lapply(groc, scale))
summary(df_z)
## Channel Region Fresh Milk
## Min. :-0.6895 Min. :-1.9931 Min. :-0.9486 Min. :-0.7779
## 1st Qu.:-0.6895 1st Qu.:-0.7015 1st Qu.:-0.7015 1st Qu.:-0.5776
## Median :-0.6895 Median : 0.5900 Median :-0.2764 Median :-0.2939
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.4470 3rd Qu.: 0.5900 3rd Qu.: 0.3901 3rd Qu.: 0.1889
## Max. : 1.4470 Max. : 0.5900 Max. : 7.9187 Max. : 9.1732
## Grocery Frozen Detergents_Paper Delicassen
## Min. :-0.8364 Min. :-0.62763 Min. :-0.6037 Min. :-0.5396
## 1st Qu.:-0.6101 1st Qu.:-0.47988 1st Qu.:-0.5505 1st Qu.:-0.3960
## Median :-0.3363 Median :-0.31844 Median :-0.4331 Median :-0.1984
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.2846 3rd Qu.: 0.09935 3rd Qu.: 0.2182 3rd Qu.: 0.1047
## Max. : 8.9264 Max. :11.90545 Max. : 7.9586 Max. :16.4597
plot(df_z$Channel, df_z$Fresh, main="Channel and Fresh")
plot(df_z$Region, df_z$Fresh, main="Region and Fresh")
plot(df_z$Fresh, df_z$Grocery, main="Fresh and Grocery")
plot(df_z$Frozen, df_z$Grocery, main="Frozen and Grocery")
plot(df_z$Fresh, df_z$Delicassen, main="Fresh and Delicassen")
The K-Means is started here with a base of 2 clusters. and produces a 26.2% between sum of squares and total sum of squares ratio. We then look at the objects in the model and plot. In the first plot the two cluster points are close in the bottom right corner “X”, as speculated. Then to compare 5, 8 clusters are presented. The 5 cluster give a ratio of 56.2% and the 8 cluster gives a ratio of 69.7% which during testing was as high as 71%. Thus the within-cluster sum of squares of 26.2% is the lowest and better cluster.
set.seed(5678)
km.out <- kmeans(df_z, 2, nstart = 20)
km.out
## K-means clustering with 2 clusters of sizes 305, 135
##
## Cluster means:
## Channel Region Fresh Milk Grocery Frozen
## 1 -0.6334724 -0.04941902 0.1245579 -0.3357379 -0.4216872 0.1225824
## 2 1.4311785 0.11165038 -0.2814087 0.7585190 0.9527008 -0.2769455
## Detergents_Paper Delicassen
## 1 -0.4368095 -0.09157253
## 2 0.9868659 0.20688608
##
## Clustering vector:
## [1] 2 2 2 1 2 2 2 2 1 2 2 1 2 2 2 1 2 1 2 1 2 1 1 2 2 2 1 1 2 1 1 1 1 1 1 2 1
## [38] 2 2 1 1 1 2 2 2 2 2 2 2 2 1 1 1 2 1 1 2 2 1 1 2 2 2 2 1 2 1 2 1 1 1 1 1 1
## [75] 2 1 1 2 1 1 1 2 2 1 2 2 2 1 1 1 1 1 2 1 2 1 2 1 1 1 2 2 2 1 1 1 2 2 2 2 1
## [112] 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
## [149] 1 1 1 1 1 1 1 2 2 1 2 2 2 1 1 2 2 2 2 1 1 1 2 2 1 2 1 2 1 1 1 1 1 1 1 2 1
## [186] 1 1 1 2 2 1 1 1 2 1 1 1 2 1 1 2 2 1 1 1 2 1 2 1 2 1 2 1 1 2 1 2 1 2 1 1 1
## [223] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1
## [260] 1 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 2
## [297] 1 2 2 1 2 2 2 2 2 2 2 1 1 2 1 1 2 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1
## [334] 2 1 2 1 1 1 1 2 2 1 2 1 1 2 2 1 2 1 2 1 2 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1
## [371] 1 1 1 2 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1
## [408] 2 2 1 1 1 1 1 1 2 2 1 2 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
##
## Within cluster sum of squares by cluster:
## [1] 1315.784 1277.692
## (between_SS / total_SS = 26.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
plot(df_z$Fresh, col=(km.out$cluster+3), main="K-means clustering results with K=2", xlab="", ylab="", pch=1.25, cex=1)
points(km.out$centers[,c("Fresh", "Frozen")], pch=1, cex=3, col="red", lwd=2)
set.seed(5679)
km.out <- kmeans(df_z, 5, nstart = 20)
km.out
## K-means clustering with 5 clusters of sizes 91, 210, 10, 126, 3
##
## Cluster means:
## Channel Region Fresh Milk Grocery Frozen
## 1 -0.5721212 -1.59567800 0.01452064 -0.3443661 -0.4020087 0.079577121
## 2 -0.6793383 0.58999669 0.11253552 -0.3555734 -0.4424744 0.073160058
## 3 1.4470045 -0.05577083 0.31347349 3.9174467 4.2707490 -0.003570131
## 4 1.4470045 0.16973529 -0.31436434 0.4519519 0.6653892 -0.350667521
## 5 -0.6895122 0.15948501 3.84044484 3.2957757 0.9852919 7.204892918
## Detergents_Paper Delicassen
## 1 -0.4239285 -0.13295117
## 2 -0.4432338 -0.09138933
## 3 4.6129149 0.50279301
## 4 0.6824271 0.04653468
## 5 -0.1527927 6.79967230
##
## Clustering vector:
## [1] 4 4 4 2 4 4 4 4 2 4 4 4 4 4 4 2 4 2 4 2 4 2 2 4 4 4 2 2 4 2 2 2 2 2 2 4 2
## [38] 4 4 2 2 2 4 4 4 4 4 3 4 4 2 2 4 4 2 2 3 4 2 2 4 3 4 4 2 3 2 4 2 2 2 2 2 4
## [75] 4 2 2 4 2 2 2 4 4 2 4 3 3 2 2 2 2 2 3 2 4 2 4 2 2 2 4 4 4 2 2 2 4 4 4 4 2
## [112] 4 2 2 2 2 2 2 2 2 2 2 2 4 2 2 2 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 2 2
## [149] 2 2 2 2 2 2 2 4 4 2 4 4 4 2 2 4 4 4 4 2 2 2 4 4 2 4 2 4 2 2 2 2 2 5 2 5 2
## [186] 2 2 2 4 4 2 2 2 4 2 2 1 4 1 1 4 4 1 1 1 4 1 1 1 4 1 3 1 1 4 1 4 1 4 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 3 1 1 1 1 1 1 1
## [260] 1 1 1 1 1 4 1 4 1 4 1 1 1 1 2 2 2 2 2 2 4 2 4 2 2 2 2 2 2 2 2 2 2 2 4 1 4
## [297] 1 4 4 1 4 4 4 4 4 4 4 1 1 4 1 1 4 1 1 4 1 1 1 4 1 1 1 1 1 5 1 1 1 1 1 4 1
## [334] 3 1 4 1 1 1 1 4 4 2 4 2 2 4 4 2 4 2 4 2 4 2 2 2 4 2 2 2 2 2 2 2 4 2 2 2 2
## [371] 2 2 2 4 2 2 4 2 2 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 2 2 2 2 2 2 2 2 2 2
## [408] 4 4 2 2 2 2 2 2 4 4 2 4 2 2 4 2 4 4 2 2 2 2 2 2 2 2 2 2 2 2 4 2 2
##
## Within cluster sum of squares by cluster:
## [1] 221.0813 541.8922 160.2905 398.3096 215.6517
## (between_SS / total_SS = 56.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
plot(df_z$Fresh, col=(km.out$cluster+1), main="K-means clustering results with K=5", xlab="", ylab="", pch=1.25, cex=1)
points(km.out$centers[,c("Fresh", "Frozen")], pch=1, cex=3, col="red", lwd=2)
set.seed(5671)
km.out <- kmeans(df_z, 8, nstart = 20)
km.out
## K-means clustering with 8 clusters of sizes 36, 156, 5, 1, 5, 85, 59, 93
##
## Cluster means:
## Channel Region Fresh Milk Grocery Frozen
## 1 1.4470045 -0.5221585 -0.4928922 1.3582881 1.7487863 -0.28074063
## 2 -0.6895122 0.5899967 -0.3416985 -0.3904737 -0.4688299 -0.21071338
## 3 1.4470045 0.3316897 1.0755395 5.1033075 5.6319063 -0.08979632
## 4 -0.6895122 0.5899967 1.9645810 5.1696185 1.2857533 6.89275382
## 5 -0.6895122 0.3316897 3.6134035 0.7451291 0.2123001 5.43102828
## 6 -0.5638348 -1.6132101 -0.1569262 -0.3436053 -0.3976095 0.01054604
## 7 -0.5808758 0.3492020 1.4339190 -0.2537941 -0.3715451 0.69869430
## 8 1.4470045 0.4233470 -0.2755129 0.2342368 0.3805628 -0.35204667
## Detergents_Paper Delicassen
## 1 1.8447981 0.34264536
## 2 -0.4329138 -0.18642642
## 3 5.6823687 0.41981740
## 4 -0.5542311 16.45971129
## 5 -0.2457485 0.89703352
## 6 -0.4107407 -0.15629362
## 7 -0.4685727 0.17667111
## 8 0.3984054 -0.03693887
##
## Clustering vector:
## [1] 8 8 8 2 8 8 8 8 2 8 8 8 8 8 8 2 8 2 8 2 8 2 7 1 8 8 2 2 1 7 2 2 2 7 2 8 7
## [38] 8 8 7 7 2 8 1 8 1 8 3 8 1 2 2 7 8 7 2 1 8 2 2 8 3 8 8 2 1 2 8 2 2 7 7 2 8
## [75] 8 7 2 1 2 2 2 8 8 2 8 3 3 7 2 7 2 2 1 5 8 2 8 2 2 2 8 8 8 5 2 2 8 8 8 8 2
## [112] 8 7 2 2 2 2 2 7 2 2 2 2 8 7 5 7 8 2 7 2 2 2 2 2 2 2 2 2 2 2 7 7 2 2 1 2 2
## [149] 2 7 2 2 2 2 2 1 8 2 8 8 8 2 2 1 8 8 8 2 2 2 8 1 2 8 2 8 7 2 2 2 2 5 2 4 2
## [186] 2 2 2 8 8 7 2 2 8 2 7 7 6 6 6 1 1 6 6 6 1 6 6 6 1 6 1 6 6 8 6 1 6 1 6 6 6
## [223] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 6 6 6 6 6 1 6 6 6 6 6 1 6 6 6 6 6 6 7
## [260] 7 6 6 6 6 1 6 1 6 1 6 6 6 6 7 2 2 7 7 2 8 2 8 7 7 7 7 2 7 2 7 2 2 2 8 6 8
## [297] 6 8 8 6 8 1 8 8 1 8 1 6 6 1 6 7 1 6 6 8 6 6 6 1 6 6 6 6 7 5 6 6 6 6 6 1 6
## [334] 3 7 8 6 6 6 6 8 8 2 1 2 2 8 8 2 1 2 1 2 8 2 2 7 8 2 2 2 2 2 2 2 8 2 2 2 2
## [371] 7 7 2 8 2 2 8 7 2 8 7 7 7 2 2 2 2 2 2 2 2 2 2 7 2 2 8 2 2 2 2 7 7 7 2 2 7
## [408] 8 8 2 2 2 2 7 2 8 8 2 8 2 2 8 7 8 8 7 2 7 2 2 2 7 2 2 2 7 7 1 2 2
##
## Within cluster sum of squares by cluster:
## [1] 190.61080 129.76038 92.86852 0.00000 119.17144 165.34225 229.77076
## [8] 137.67863
## (between_SS / total_SS = 69.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
set.seed(4567)
km.out <- kmeans(df_z, 8, nstart = 50)
km.out
## K-means clustering with 8 clusters of sizes 5, 59, 156, 36, 85, 1, 5, 93
##
## Cluster means:
## Channel Region Fresh Milk Grocery Frozen
## 1 1.4470045 0.3316897 1.0755395 5.1033075 5.6319063 -0.08979632
## 2 -0.5808758 0.3492020 1.4339190 -0.2537941 -0.3715451 0.69869430
## 3 -0.6895122 0.5899967 -0.3416985 -0.3904737 -0.4688299 -0.21071338
## 4 1.4470045 -0.5221585 -0.4928922 1.3582881 1.7487863 -0.28074063
## 5 -0.5638348 -1.6132101 -0.1569262 -0.3436053 -0.3976095 0.01054604
## 6 -0.6895122 0.5899967 1.9645810 5.1696185 1.2857533 6.89275382
## 7 -0.6895122 0.3316897 3.6134035 0.7451291 0.2123001 5.43102828
## 8 1.4470045 0.4233470 -0.2755129 0.2342368 0.3805628 -0.35204667
## Detergents_Paper Delicassen
## 1 5.6823687 0.41981740
## 2 -0.4685727 0.17667111
## 3 -0.4329138 -0.18642642
## 4 1.8447981 0.34264536
## 5 -0.4107407 -0.15629362
## 6 -0.5542311 16.45971129
## 7 -0.2457485 0.89703352
## 8 0.3984054 -0.03693887
##
## Clustering vector:
## [1] 8 8 8 3 8 8 8 8 3 8 8 8 8 8 8 3 8 3 8 3 8 3 2 4 8 8 3 3 4 2 3 3 3 2 3 8 2
## [38] 8 8 2 2 3 8 4 8 4 8 1 8 4 3 3 2 8 2 3 4 8 3 3 8 1 8 8 3 4 3 8 3 3 2 2 3 8
## [75] 8 2 3 4 3 3 3 8 8 3 8 1 1 2 3 2 3 3 4 7 8 3 8 3 3 3 8 8 8 7 3 3 8 8 8 8 3
## [112] 8 2 3 3 3 3 3 2 3 3 3 3 8 2 7 2 8 3 2 3 3 3 3 3 3 3 3 3 3 3 2 2 3 3 4 3 3
## [149] 3 2 3 3 3 3 3 4 8 3 8 8 8 3 3 4 8 8 8 3 3 3 8 4 3 8 3 8 2 3 3 3 3 7 3 6 3
## [186] 3 3 3 8 8 2 3 3 8 3 2 2 5 5 5 4 4 5 5 5 4 5 5 5 4 5 4 5 5 8 5 4 5 4 5 5 5
## [223] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 2 5 5 5 5 5 4 5 5 5 5 5 4 5 5 5 5 5 5 2
## [260] 2 5 5 5 5 4 5 4 5 4 5 5 5 5 2 3 3 2 2 3 8 3 8 2 2 2 2 3 2 3 2 3 3 3 8 5 8
## [297] 5 8 8 5 8 4 8 8 4 8 4 5 5 4 5 2 4 5 5 8 5 5 5 4 5 5 5 5 2 7 5 5 5 5 5 4 5
## [334] 1 2 8 5 5 5 5 8 8 3 4 3 3 8 8 3 4 3 4 3 8 3 3 2 8 3 3 3 3 3 3 3 8 3 3 3 3
## [371] 2 2 3 8 3 3 8 2 3 8 2 2 2 3 3 3 3 3 3 3 3 3 3 2 3 3 8 3 3 3 3 2 2 2 3 3 2
## [408] 8 8 3 3 3 3 2 3 8 8 3 8 3 3 8 2 8 8 2 3 2 3 3 3 2 3 3 3 2 2 4 3 3
##
## Within cluster sum of squares by cluster:
## [1] 92.86852 229.77076 129.76038 190.61080 165.34225 0.00000 119.17144
## [8] 137.67863
## (between_SS / total_SS = 69.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Here we wanted to see what would happen if the “Channel” and “Region” are dropped and just the remaining 6 variables are used. And 2 clusters dropped the WCSS down to 26% even with cluster centers still in bottom left corner. The other clusters; k=6 has a 65.1% ratio sum of squares, and the k=8 increased WCSS to 71.7%. Thus 2 clusters maybe the best option. However, everything is compressed down in the bottom left corner, this seems hard to see what is going on visually in a scatter plot. Maybe not the best data set to use for an introductory exercise. We will use one more method to determine the optimal cluster number below.
groc <- df[3:8]
df_t <- as.data.frame(lapply(groc, scale))
set.seed(4568)
km.out1 <- kmeans(df_t, 2, nstart = 20)
km.out1
## K-means clustering with 2 clusters of sizes 399, 41
##
## Cluster means:
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 -0.00542930 -0.2122882 -0.2302493 -0.03310806 -0.2320799 -0.0826124
## 2 0.05283636 2.0659269 2.2407190 0.32219794 2.2585338 0.8039597
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 2 1 2 2 2 1 2 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1
## [75] 1 1 1 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
## [149] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 2 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 2 1 2 1 1 1 1 2 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1
## [260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [297] 1 1 1 1 1 2 1 1 2 1 2 1 1 2 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 2 1
## [334] 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [371] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [408] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
##
## Within cluster sum of squares by cluster:
## [1] 982.9619 966.3860
## (between_SS / total_SS = 26.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
plot(df_z$Fresh, col=(km.out1$cluster+3), main="K-means clustering results with K=2", xlab="", ylab="", pch=1.25, cex=1)
points(km.out1$centers[,c("Fresh", "Frozen")], pch=1, cex=3, col="red", lwd=2)
set.seed(4569)
km.out1
## K-means clustering with 2 clusters of sizes 399, 41
##
## Cluster means:
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 -0.00542930 -0.2122882 -0.2302493 -0.03310806 -0.2320799 -0.0826124
## 2 0.05283636 2.0659269 2.2407190 0.32219794 2.2585338 0.8039597
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 2 1 2 2 2 1 2 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1
## [75] 1 1 1 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
## [149] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 2 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 2 1 2 1 1 1 1 2 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1
## [260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [297] 1 1 1 1 1 2 1 1 2 1 2 1 1 2 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 2 1
## [334] 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [371] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [408] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
##
## Within cluster sum of squares by cluster:
## [1] 982.9619 966.3860
## (between_SS / total_SS = 26.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
plot(df_z$Fresh, col=(km.out1$cluster+2), main="K-means clustering results with K=6", xlab="", ylab="", pch=1, cex=1)
set.seed(4562)
km.out1 <- kmeans(df_t, 6, nstart = 20)
km.out1
## K-means clustering with 6 clusters of sizes 242, 88, 10, 94, 1, 5
##
## Cluster means:
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 -0.3472255 -0.3886643 -0.4345878 -0.182611380 -0.3885364 -0.21489822
## 2 1.2297852 -0.2212199 -0.2997703 0.471366329 -0.4171626 0.19696794
## 3 0.3134735 3.9174467 4.2707490 -0.003570131 4.6129149 0.50279301
## 4 -0.5038176 0.6963232 0.9201628 -0.332983798 0.9190419 0.09254661
## 5 1.9645810 5.1696185 1.2857533 6.892753825 -0.5542311 16.45971129
## 6 3.6134035 0.7451291 0.2123001 5.431028285 -0.2457485 0.89703352
##
## Clustering vector:
## [1] 1 4 4 1 2 1 1 1 1 4 4 1 2 4 4 1 4 1 2 1 1 1 2 4 2 1 1 1 4 2 2 1 1 2 1 1 2
## [38] 4 4 2 2 2 4 4 4 4 4 3 4 4 1 1 2 4 2 1 3 4 1 1 1 3 1 4 1 3 1 4 1 1 2 2 1 2
## [75] 1 2 1 4 1 1 1 4 4 2 1 3 3 2 1 2 1 1 3 6 4 1 1 1 1 1 4 4 1 6 1 1 4 4 1 4 1
## [112] 4 2 1 1 1 1 1 2 1 1 1 1 1 2 6 2 2 1 2 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 4 1 1
## [149] 1 2 1 1 1 1 1 4 4 1 1 4 4 1 1 4 1 4 4 1 1 1 4 4 1 4 1 4 2 1 1 1 1 6 4 5 1
## [186] 1 1 1 4 4 2 1 1 4 1 2 2 4 1 1 4 4 2 1 1 4 1 1 1 4 1 3 1 1 4 4 4 1 4 1 1 4
## [223] 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 2 1 2 2 2 1 1 4 4 1 1 1 1 1 3 1 2 4 2 1 1 2
## [260] 2 1 1 2 1 4 4 4 2 4 1 1 1 1 2 1 1 2 2 1 1 1 1 2 2 2 2 1 2 1 2 1 1 1 4 2 1
## [297] 1 1 1 1 1 4 4 4 4 4 4 1 1 4 1 2 4 1 1 4 1 1 1 4 1 1 1 1 2 6 1 1 2 1 1 4 2
## [334] 3 2 2 1 1 1 1 4 4 1 4 1 1 4 2 1 4 1 4 1 4 2 1 2 4 1 1 1 1 1 1 1 1 1 1 2 1
## [371] 2 2 1 1 1 1 4 2 1 1 2 2 2 1 4 1 1 2 1 1 1 1 1 2 1 1 4 1 1 1 1 2 2 2 2 1 2
## [408] 4 1 1 1 1 1 2 1 1 4 1 4 1 4 1 2 1 1 2 4 2 1 1 1 2 1 1 1 2 2 4 1 1
##
## Within cluster sum of squares by cluster:
## [1] 187.7405 240.9513 149.4481 224.3460 0.0000 117.8370
## (between_SS / total_SS = 65.1 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
set.seed(4561)
km.out1 <- kmeans(df_t, 8, nstart = 20)
km.out1
## K-means clustering with 8 clusters of sizes 42, 2, 28, 41, 98, 5, 1, 223
##
## Cluster means:
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 0.2221728 -0.2896876 -0.42682693 1.43519472 -0.5011524 -0.04768952
## 2 0.7918828 0.5610464 -0.01128859 9.24203651 -0.4635194 0.93210312
## 3 -0.4453660 1.5449634 1.95334680 -0.25357812 2.1391487 0.38983626
## 4 2.2141700 -0.1367268 -0.16118733 0.32383076 -0.3593641 0.41697371
## 5 -0.4883343 0.4118988 0.50256612 -0.31004906 0.4748524 0.01870206
## 6 1.0755395 5.1033075 5.63190631 -0.08979632 5.6823687 0.41981740
## 7 1.9645810 5.1696185 1.28575327 6.89275382 -0.5542311 16.45971129
## 8 -0.2184362 -0.4379401 -0.48803805 -0.27353403 -0.4375782 -0.21643132
##
## Clustering vector:
## [1] 5 5 5 1 4 8 8 5 8 5 5 8 4 5 5 8 5 8 5 8 8 8 4 3 5 8 8 8 3 4 8 8 8 4 8 5 4
## [38] 5 5 4 1 8 5 3 5 3 5 6 5 3 8 8 4 5 4 8 3 5 8 5 8 6 5 5 8 3 8 5 8 8 1 4 1 1
## [75] 5 8 1 3 8 8 8 5 5 8 8 6 6 4 1 4 8 1 3 2 5 8 8 8 8 8 5 5 5 4 8 8 5 5 5 5 8
## [112] 5 1 8 8 8 8 8 1 8 8 8 8 5 4 4 1 5 8 4 8 8 8 8 8 8 5 8 8 8 8 4 4 1 8 3 8 8
## [149] 8 4 8 8 8 8 8 3 5 8 5 5 5 8 8 3 5 5 5 8 8 8 5 3 8 5 8 5 4 8 8 8 8 4 5 7 8
## [186] 8 8 5 5 5 1 8 8 5 8 1 1 5 8 8 3 3 4 8 8 3 8 5 8 3 8 3 8 5 5 5 3 8 5 8 8 5
## [223] 1 8 8 8 8 8 8 1 1 8 8 8 8 8 8 8 8 4 1 8 8 8 5 5 8 8 8 8 8 3 1 4 5 4 8 8 4
## [260] 4 8 1 8 8 5 5 5 8 5 8 8 8 8 4 8 8 4 1 8 5 8 8 4 1 4 4 8 1 8 4 8 8 8 5 8 8
## [297] 8 8 5 8 8 5 5 5 3 5 5 8 8 5 1 4 3 8 8 5 8 8 8 3 8 8 8 8 8 2 8 8 1 8 8 3 8
## [334] 6 1 4 8 1 1 1 5 5 5 3 8 8 5 4 8 3 8 3 8 5 1 8 8 5 5 8 8 8 8 8 8 5 8 8 8 8
## [371] 4 1 8 8 8 8 5 4 8 8 4 1 4 8 5 8 8 8 8 8 8 8 8 1 8 8 5 1 8 8 8 1 8 8 8 8 1
## [408] 5 8 8 8 8 5 1 8 5 5 5 5 8 5 8 8 8 8 1 5 1 8 8 5 1 8 8 8 1 4 3 8 8
##
## Within cluster sum of squares by cluster:
## [1] 53.88599 18.64568 113.32003 187.75738 124.78705 91.53407 0.00000
## [8] 145.22385
## (between_SS / total_SS = 72.1 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
plot(df$Fresh, df$Frozen, col=(km.out1$cluster+2), main="K-means clustering results with K=6", xlab="", ylab="", pch=19, cex=1)
plot(df[c("Fresh", "Milk")], col=km.out1$cluster)
points(km.out1$centers[,c("Fresh", "Milk")], pch=1, cex=3, col="red", lwd=2)
plot(df[c("Fresh", "Frozen")], col=km.out1$cluster)
points(km.out1$centers[,c("Fresh", "Frozen")], pch=1, cex=3, col="red", lwd=2)
Here we run the full scaled data and check 2 to 10 clusters for the best option. The graph is clear with a steady decline down from 2. the WCSS decreases. The elbow of the silhouette width plot is interesting because it drops after 2 clusters but bounces back up to 4 before dropping again. This unknown why it does this but we are still in the 2 cluster option. But this test is telling us 4 clusters is the optimal value. If it says 4 then we will run the k=4 and produce a WCSS 47.8% which doesn’t seem better that 2 clusters at 26% WCSS. The center of the points show 3 in the bottom left which is expected and one up in the top left which is completely unexpected and confusing. Something doesn’t seem right here and it is unknown what that is.
nk <- 2:10
set.seed(22)
WSS <- sapply(nk, function(k) {kmeans(df_z, centers = k)$tot.withinss})
WSS
## [1] 2593.4767 2269.5053 1944.3840 1676.5956 1467.4379 1186.9108 1120.7044
## [8] 1031.2988 908.9784
plot(nk, WSS, type="l", xlab="number of k", ylab="within sum of squares")
SW <- sapply(nk, function(k) {cluster.stats(dist(df_z), kmeans(df_z,centers=k)$cluster)$avg.silwidth})
SW
## [1] 0.3732334 0.3067572 0.3224285 0.3562636 0.3469665 0.2809717 0.3482999
## [8] 0.3277299 0.3257194
plot(nk, SW, type="l", xlab="number ofclusters", ylab="average silhouette width")
nk[which.max(SW)]
## [1] 2
km.out <- kmeans(df_z, 4, nstart = 20)
km.out
## K-means clustering with 4 clusters of sizes 10, 296, 3, 131
##
## Cluster means:
## Channel Region Fresh Milk Grocery Frozen
## 1 1.4470045 -0.05577083 0.31347349 3.9174467 4.2707490 -0.003570131
## 2 -0.6822942 -0.04704424 0.07979916 -0.3554503 -0.4344436 0.073138067
## 3 -0.6895122 0.15948501 3.84044484 3.2957757 0.9852919 7.204892918
## 4 1.4470045 0.10690343 -0.29218794 0.4286373 0.6330682 -0.329983553
## Detergents_Paper Delicassen
## 1 4.6129149 0.50279301
## 2 -0.4417690 -0.10544775
## 3 -0.1527927 6.79967230
## 4 0.6495639 0.04416479
##
## Clustering vector:
## [1] 4 4 4 2 4 4 4 4 2 4 4 4 4 4 4 2 4 2 4 2 4 2 2 4 4 4 2 2 4 2 2 2 2 2 2 4 2
## [38] 4 4 2 2 2 4 4 4 4 4 1 4 4 2 2 4 4 2 2 1 4 2 2 4 1 4 4 2 1 2 4 2 2 2 2 2 4
## [75] 4 2 2 4 2 2 2 4 4 2 4 1 1 2 2 2 2 2 1 2 4 2 4 2 2 2 4 4 4 2 2 2 4 4 4 4 2
## [112] 4 2 2 2 2 2 2 2 2 2 2 2 4 2 2 2 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 2 2
## [149] 2 2 2 2 2 2 2 4 4 2 4 4 4 2 2 4 4 4 4 2 2 2 4 4 2 4 2 4 2 2 2 2 2 3 2 3 2
## [186] 2 2 2 4 4 2 2 2 4 2 2 2 4 2 2 4 4 2 2 2 4 2 4 2 4 2 1 2 2 4 2 4 2 4 2 2 2
## [223] 2 4 2 2 4 2 2 2 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 2 2 2 2 2 1 2 2 2 2 2 2 2
## [260] 2 2 2 2 2 4 2 4 2 4 2 2 2 2 2 2 2 2 2 2 4 2 4 2 2 2 2 2 2 2 2 2 2 2 4 2 4
## [297] 2 4 4 2 4 4 4 4 4 4 4 2 2 4 2 2 4 2 2 4 2 2 2 4 2 2 2 2 2 3 2 2 2 2 2 4 2
## [334] 1 2 4 2 2 2 2 4 4 2 4 2 2 4 4 2 4 2 4 2 4 2 2 2 4 2 2 2 2 2 2 2 4 2 2 2 2
## [371] 4 2 2 4 2 2 4 2 2 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 4 2 2 2 2 2 2 2 2 2 2
## [408] 4 4 2 2 2 2 2 2 4 4 2 4 2 2 4 2 4 4 2 2 2 2 2 2 2 2 2 2 2 2 4 2 2
##
## Within cluster sum of squares by cluster:
## [1] 160.2905 1020.2318 215.6517 437.3645
## (between_SS / total_SS = 47.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
plot(df_z$Fresh, col=(km.out$cluster+2), main="K-means clustering results with K=4", xlab="", ylab="", pch=19, cex=0.7)
points(km.out$centers[,c("Fresh", "Frozen")], pch=1, cex=3, col="red", lwd=2)
Hierarchial Cluster Analysis
This exercise produces a mess of a dendrograph but the general shape of how it breaks down the data is somewhat legible by shapes only. Complete linkage seems the best. Single linkage is very hard to read.
set.seed(89)
km <- kmeans(df_z, 4, nstart = 20)
hc.complete <- hclust(dist(df_z), method= "complete")
hc.average <- hclust(dist(df_z), method= "average")
hc.single <- hclust(dist(df_z), method= "single")
#Bad display par(mfrow =c(1,3))
plot(hc.complete, main="Complete Linkage", xlab ="", sub ="", cex =0.05)
plot(hc.average, main="Average Linkage", xlab ="", sub ="", cex =0.05)
plot(hc.single, main="Single Linkage", xlab ="", sub ="", cex =0.01)
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [297] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [334] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [371] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [408] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [297] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [334] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [371] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [408] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [297] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [334] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [371] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [408] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 3 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [297] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1
## [334] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [371] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [408] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
xsc <- scale(df_z)
plot(hclust(dist(xsc), method= "complete"), main= "Hierarchical Clustering with Scaled Features", cex =0.01)
This graph uses complete linkage and produces a legible dendogram. The WCSS is 1846 and average silwidth of 0.34. Comparatively, the single to complete give 2918 and 3159 WCSS and ASW 0.66 and 0.82. Is this good? More descriptive direction is needed on this output. I guess its good?
x <- matrix(rnorm (30*3), ncol =3)
dd <- as.dist(1-cor(t(x)))
plot(hclust(dd, method= "complete"), main= "Complete Linkage with Correlation - Based Distance", xlab="", sub="")
hc_complete <- cutree(hc.complete, 2)
hc_single <- cutree(hc.single, 4)
#library fpc
cs <- cluster.stats(dist(df_z), km$cluster)
cs[c("within.cluster.ss", "avg.silwidth")]
## $within.cluster.ss
## [1] 1833.538
##
## $avg.silwidth
## [1] 0.3674071
sapply(list(kmeans=km$cluster, hc_single=hc_single, hc_complete=hc_complete), function(c) cluster.stats(dist(df_z), c)[c("within.cluster.ss", "avg.silwidth")])
## kmeans hc_single hc_complete
## within.cluster.ss 1833.538 2918.012 3159.398
## avg.silwidth 0.3674071 0.6678067 0.8263512
This method is an alternate to the one above. Maybe better graphing and understanding of what is happening can be gained from this method. The first part determines the optimal number of clusters. We will also drop the Channel and Region variables for this section. This method is straight forward and clearly shows 9 proposed as 2 is the best number of clusters. The data spells it out and the bar chart clearly shows us 2 clusters is optimal with a 3 the next optimal.
Optional Alternative Method
df_hc <- df_z[3:8]
str(df_hc)
## 'data.frame': 440 obs. of 6 variables:
## $ Fresh : num 0.0529 -0.3909 -0.4465 0.1 0.8393 ...
## $ Milk : num 0.523 0.5438 0.4081 -0.6233 -0.0523 ...
## $ Grocery : num -0.0411 0.1701 -0.0281 -0.3925 -0.0793 ...
## $ Frozen : num -0.589 -0.27 -0.137 0.686 0.174 ...
## $ Detergents_Paper: num -0.0435 0.0863 0.1331 -0.498 -0.2317 ...
## $ Delicassen : num -0.0663 0.089 2.2407 0.0933 1.2979 ...
nb <- NbClust(df_hc, distance="euclidean", min.nc = 2, max.nc = 10, method = "complete", index = "all")
## Warning in pf(beale, pp, df2): NaNs produced
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 9 proposed 2 as the best number of clusters
## * 7 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 2 proposed 8 as the best number of clusters
## * 4 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
fviz_nbclust(nb) + theme_minimal()
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 9 proposed 2 as the best number of clusters
## * 7 proposed 3 as the best number of clusters
## * 1 proposed 4 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 2 proposed 8 as the best number of clusters
## * 4 proposed 10 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 2 .
km.res <- eclust(df_hc, "kmeans", k = 2, nstart=25, graph = FALSE)
km.res$cluster
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 1 2 1 1 1 2 1 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2 2
## [75] 2 2 2 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2
## [149] 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 1 2
## [186] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 1 2 2 2 1 2 1 2 2 2 2 1 2 2 2 2 2
## [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2
## [260] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [297] 2 2 2 2 2 1 2 2 1 2 1 2 2 1 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 1 2
## [334] 1 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [371] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [408] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2
The cluster plot is now more legible in this graphic. k=2 and the center of cluster 1 is a red dot and the center of cluster 2 is the blue triangle.
fviz_cluster(km.res, geom="point", ellipse.type="norm")
pam.res <- eclust(df_hc, "pam", k=2, graph=FALSE)
pam.res$cluster
## [1] 1 2 2 1 1 1 1 1 1 2 2 1 2 2 2 1 2 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1 1 1 1
## [38] 2 2 1 1 1 2 2 1 2 2 2 2 2 1 1 1 2 1 1 2 2 1 1 1 2 1 2 1 2 1 2 1 1 1 2 1 1
## [75] 1 1 1 2 1 1 1 2 2 1 1 2 2 1 1 1 1 1 2 1 2 1 1 1 1 1 2 2 1 1 1 1 2 2 1 2 1
## [112] 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
## [149] 1 1 1 1 1 1 1 2 2 1 1 2 2 1 1 2 1 2 2 1 1 1 2 2 1 2 1 2 1 1 1 1 1 2 2 2 1
## [186] 1 1 1 2 2 1 1 1 2 1 1 1 2 1 1 2 2 1 1 1 2 1 1 1 2 1 2 1 1 2 2 2 1 2 1 1 2
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 2 1 1 1 1
## [260] 1 1 1 1 1 2 2 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
## [297] 1 1 1 1 1 2 2 2 2 2 2 1 1 2 1 1 2 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1
## [334] 2 1 1 1 1 1 1 2 2 1 2 1 1 2 1 1 2 1 2 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1
## [371] 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1
## [408] 2 1 1 1 1 1 1 1 1 2 1 2 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1
fviz_cluster(pam.res, geom="point", ellipse.type="norm")
res.hc <- eclust(df_hc, "hclust", k=2, method="complete", graph=FALSE)
head(res.hc$cluster, 10)
## [1] 1 1 1 1 1 1 1 1 1 1
This dendogram is better than before with some color added. However the cluster 1 is stuffed in the left corner.
fviz_dend(res.hc, rect=TRUE, show_labels=FALSE)
sil <- silhouette(km.res$cluster, dist(df_hc))
head(sil[,1:3], 10)
## cluster neighbor sil_width
## [1,] 2 1 0.6614836
## [2,] 2 1 0.6378712
## [3,] 2 1 0.4547508
## [4,] 2 1 0.7168852
## [5,] 2 1 0.5961813
## [6,] 2 1 0.7059518
## [7,] 2 1 0.7204588
## [8,] 2 1 0.6808825
## [9,] 2 1 0.7349506
## [10,] 2 1 0.3320876
fviz_silhouette(sil)
## cluster size ave.sil.width
## 1 1 41 0.03
## 2 2 399 0.65
si.sum <- summary(sil)
si.sum$clus.avg.widths
## 1 2
## 0.02503374 0.64967036
si.sum$avg.width
## [1] 0.5914656
si.sum$clus.sizes
## cl
## 1 2
## 41 399
fviz_silhouette(km.res)
## cluster size ave.sil.width
## 1 1 41 0.03
## 2 2 399 0.65
silinfo <- km.res$silinfo
names(silinfo)
## [1] "widths" "clus.avg.widths" "avg.width"
head(silinfo$widths)
## cluster neighbor sil_width
## 212 1 2 0.3089443
## 93 1 2 0.2887751
## 66 1 2 0.2857945
## 62 1 2 0.2785948
## 252 1 2 0.2713289
## 57 1 2 0.2534072
silinfo$clus.avg.widths
## [1] 0.02503374 0.64967036
silinfo$avg.width
## [1] 0.5914656
km.res$size
## [1] 41 399
dd <- dist(df_hc, method = "euclidean")
km_stats <- cluster.stats(dd, km.res$cluster)
km_stats$within.cluster.ss
## [1] 1949.348
km_stats$clus.avg.silwidths
## 1 2
## 0.02503374 0.64967036
The Dunn statistic is 0.0189 and Dunn2 1.07.
km_stats
## $n
## [1] 440
##
## $cluster.number
## [1] 2
##
## $cluster.size
## [1] 41 399
##
## $min.cluster.size
## [1] 41
##
## $noisen
## [1] 0
##
## $diameter
## [1] 21.244265 7.938436
##
## $average.distance
## [1] 5.223323 1.933809
##
## $median.distance
## [1] 3.149260 1.726398
##
## $separation
## [1] 0.4019017 0.4019017
##
## $average.toother
## [1] 5.595136 5.595136
##
## $separation.matrix
## [,1] [,2]
## [1,] 0.0000000 0.4019017
## [2,] 0.4019017 0.0000000
##
## $ave.between.matrix
## [,1] [,2]
## [1,] 0.000000 5.595136
## [2,] 5.595136 0.000000
##
## $average.between
## [1] 5.595136
##
## $average.within
## [1] 2.240332
##
## $n.between
## [1] 16359
##
## $n.within
## [1] 80221
##
## $max.diameter
## [1] 21.24427
##
## $min.separation
## [1] 0.4019017
##
## $within.cluster.ss
## [1] 1949.348
##
## $clus.avg.silwidths
## 1 2
## 0.02503374 0.64967036
##
## $avg.silwidth
## [1] 0.5914656
##
## $g2
## NULL
##
## $g3
## NULL
##
## $pearsongamma
## [1] 0.5891823
##
## $dunn
## [1] 0.01891813
##
## $dunn2
## [1] 1.071183
##
## $entropy
## [1] 0.3098382
##
## $wb.ratio
## [1] 0.400407
##
## $ch
## [1] 153.8348
##
## $cwidegap
## [1] 12.977504 3.612317
##
## $widestgap
## [1] 12.9775
##
## $sindex
## [1] 0.9162984
##
## $corrected.rand
## NULL
##
## $vi
## NULL