R语言聚类分析(1)

Original 阿越就是我医学和生信笔记 2023-06-15

收录于合集

#医学统计学 63 个

#r语言 205 个

#聚类/主成分分析 12 个

关注公众号，发送R语言或python，可获取资料

“
医学和生信笔记，专注R语言在临床医学中的使用、R语言数据分析和可视化。主要分享R语言做医学统计学、临床研究设计、meta分析、网络药理学、临床预测模型、机器学习、生物信息学等。

💡专注R语言在🩺生物医学中的使用

主要介绍使用R语言进行层次聚类、划分聚类（K均值聚类和PAM）。

关于更多聚类分析可视化方法，可以参考下一期推文～

系统聚类（层次聚类,Hierarchical clustering）

使用nutrient数据集进行演示，这个数据集包含不同食物中的营养物质含量。

# 没安装flexclust包的需要先安装
data(nutrient, package = "flexclust")
row.names(nutrient) <- tolower(row.names(nutrient))

dim(nutrient) # 27行5列
## [1] 27  5

psych::headTail(nutrient)
##                 energy protein fat calcium iron
## beef braised       340      20  28       9  2.6
## hamburger          245      21  17       9  2.7
## beef roast         420      15  39       7    2
## beef steak         375      19  32       9  2.6
## ...                ...     ... ...     ...  ...
## salmon canned      120      17   5     159  0.7
## sardines canned    180      22   9     367  2.5
## tuna canned        170      25   7       7  1.2
## shrimp canned      110      23   1      98  2.6

层次聚类在R语言中非常简单，通过hclust实现。

# 聚类前先进行标准化
nutrient.scaled <- scale(nutrient)
h.clust <- hclust(dist(nutrient.scaled,method = "euclidean"), # 计算距离有不同方法
                  method = "average" # 层次聚类有不同方法
                  )

下面就是画图，简单点可以直接用plot()。

plot(h.clust,hang = -1,main = "层次聚类", sub="", 
     xlab="", cex.lab = 1.0, cex.axis = 1.0, cex.main = 2)

plot of chunk unnamed-chunk-3

关于更加精细化的细节修改，下次再介绍。或者可以借助其他R包快速绘制好看的聚类分析图形。

如何选择聚类的个数呢？

可以通过R包NbClust实现。

library(NbClust)

nc <- NbClust(nutrient.scaled, distance = "euclidean",
              min.nc = 2, # 最小聚类数
              max.nc = 10, # 最大聚类树
              method = "average"
              )

plot of chunk unnamed-chunk-4

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

plot of chunk unnamed-chunk-4

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 5 proposed 2 as the best number of clusters 
## * 5 proposed 3 as the best number of clusters 
## * 2 proposed 4 as the best number of clusters 
## * 4 proposed 5 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 1 proposed 9 as the best number of clusters 
## * 5 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

输出日志里给出了评判准则以及最终结果：Hubert index和D index使用图形的方式判断最佳聚类个数，拐点明显的可视作最佳聚类个数。

它给出的结论是最佳聚类数是2。我们也可以通过条形图查看这些评判准则的具体数量。

barplot(table(nc$Best.nc[1,]),
        xlab = "聚类数目",
        ylab = "评判准则个数"
        )

plot of chunk unnamed-chunk-5

从条形图中可以看出，聚类数目为2,3,5,10时，评判准则个数最多，为5个，这里我们可以选择5个。

# 把聚类树划分为5类
cluster <- cutree(h.clust, k=5)

# 查看每一类有多少例
table(cluster)
## cluster
##  1  2  3  4  5 
##  7 16  1  2  1

把最终结果画出来：

plot(h.clust, hang = -1,main = "",xlab = "")
rect.hclust(h.clust, k=5) # 添加矩形，方便观察

plot of chunk unnamed-chunk-7

快速聚类（划分聚类,partitioning clustering）

K-means聚类

K-means聚类，K均值聚类，是快速聚类的一种。比层次聚类更适合大样本的数据。在R语言中可以通过kmeans()实现K均值聚类。

使用K均值聚类处理178种葡萄酒中13种化学成分的数据集。

data(wine, package = "rattle")
df <- scale(wine[,-1])

psych::headTail(df)
##     Alcohol Malic   Ash Alcalinity Magnesium Phenols Flavanoids Nonflavanoids
## 1      1.51 -0.56  0.23      -1.17      1.91    0.81       1.03         -0.66
## 2      0.25  -0.5 -0.83      -2.48      0.02    0.57       0.73         -0.82
## 3       0.2  0.02  1.11      -0.27      0.09    0.81       1.21          -0.5
## 4      1.69 -0.35  0.49      -0.81      0.93    2.48       1.46         -0.98
## ...     ...   ...   ...        ...       ...     ...        ...           ...
## 175    0.49  1.41  0.41       1.05      0.16   -0.79      -1.28          0.55
## 176    0.33  1.74 -0.39       0.15      1.42   -1.13      -1.34          0.55
## 177    0.21  0.23  0.01       0.15      1.42   -1.03      -1.35          1.35
## 178    1.39  1.58  1.36        1.5     -0.26   -0.39      -1.27          1.59
##     Proanthocyanins Color   Hue Dilution Proline
## 1              1.22  0.25  0.36     1.84    1.01
## 2             -0.54 -0.29   0.4     1.11    0.96
## 3              2.13  0.27  0.32     0.79    1.39
## 4              1.03  1.18 -0.43     1.18    2.33
## ...             ...   ...   ...      ...     ...
## 175           -0.32  0.97 -1.13    -1.48    0.01
## 176           -0.42  2.22 -1.61    -1.48    0.28
## 177           -0.23  1.83 -1.56     -1.4     0.3
## 178           -0.42  1.79 -1.52    -1.42   -0.59

进行K均值聚类时，需要在一开始就指定聚类的个数，我们也可以通过NbClust包实现这个过程。

library(NbClust)

set.seed(123)
nc <- NbClust(df, min.nc = 2, max.nc = 15, method = "kmeans")# 方法选择kmeans

plot of chunk unnamed-chunk-13

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

plot of chunk unnamed-chunk-13

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 2 proposed 2 as the best number of clusters 
## * 19 proposed 3 as the best number of clusters 
## * 1 proposed 14 as the best number of clusters 
## * 1 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

结果中给出了划分依据以及最佳的聚类数目为3个，可以画图查看结果：

table(nc$Best.nc[1,])
## 
##  0  1  2  3 14 15 
##  2  1  2 19  1  1

barplot(table(nc$Best.nc[1,]),
        xlab = "聚类数目",
        ylab = "评判准则个数"
        )

plot of chunk unnamed-chunk-14

可以看到聚类数目为3是最佳的选择。

确定最佳聚类个数过程也可以通过非常好用的R包factoextra实现。

library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

set.seed(123)
fviz_nbclust(df, kmeans, k.max = 15)

plot of chunk unnamed-chunk-15

这个结果给出的最佳聚类个数也是3个。

下面进行K均值聚类，聚类数目设为3.

set.seed(123)
fit.km <- kmeans(df, centers = 3, nstart = 25)
fit.km
## K-means clustering with 3 clusters of sizes 51, 62, 65
## 
## Cluster means:
##      Alcohol      Malic        Ash Alcalinity   Magnesium     Phenols
## 1  0.1644436  0.8690954  0.1863726  0.5228924 -0.07526047 -0.97657548
## 2  0.8328826 -0.3029551  0.3636801 -0.6084749  0.57596208  0.88274724
## 3 -0.9234669 -0.3929331 -0.4931257  0.1701220 -0.49032869 -0.07576891
##    Flavanoids Nonflavanoids Proanthocyanins      Color        Hue   Dilution
## 1 -1.21182921    0.72402116     -0.77751312  0.9388902 -1.1615122 -1.2887761
## 2  0.97506900   -0.56050853      0.57865427  0.1705823  0.4726504  0.7770551
## 3  0.02075402   -0.03343924      0.05810161 -0.8993770  0.4605046  0.2700025
##      Proline
## 1 -0.4059428
## 2  1.1220202
## 3 -0.7517257
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3 3 3 3 3 3 3 3 2
##  [75] 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 1 3 3 2 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 
## Within cluster sum of squares by cluster:
## [1] 326.3537 385.6983 558.6971
##  (between_SS / total_SS =  44.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

结果很详细，K均值聚类聚为3类，每一类数量分别是51,62,65。然后还给出了聚类中心，每一个观测分别属于哪一个类。

不管是哪一种聚类方法，factoextra配合factomineR都可以给出非常好看的可视化结果。

fviz_cluster(fit.km, data = df)

plot of chunk unnamed-chunk-17

有非常多的细节可以调整，大家在使用的时候可以自己尝试，和之前推文中介绍的PCA美化一样，也是支持ggplot2语法的。

fviz_cluster(fit.km, data = df, 
             ellipse = T, # 增加椭圆
             ellipse.type = "t", # 椭圆类型
             geom = "point", # 只显示点不要文字
             palette = "lancet", # 支持超多配色方案
             ggtheme = theme_bw() # 支持更换主题
             )

plot of chunk unnamed-chunk-18

围绕中心点的划分PAM

K均值聚类是基于均值的，所以对异常值很敏感。一个更稳健的方法是围绕中心点的划分（PAM）。用一个最有代表性的观测值代表这一类(有点类似于主成分)。K均值聚类一般使用欧几里得距离，而PAM可以使用任意的距离来计算。因此，PAM可以容纳混合数据类型，并且不仅限于连续变量。

我们还是用葡萄酒数据进行演示。PAM聚类可以通过cluster包中的pam()实现。

library(cluster)

set.seed(123)
fit.pam <- pam(wine[-1,], k=3 # 聚为3类
               , stand = T # 聚类前进行标准化
               )
fit.pam
## Medoids:
##      ID Type Alcohol Malic  Ash Alcalinity Magnesium Phenols Flavanoids
## 36   35    1   13.48  1.81 2.41       20.5       100    2.70       2.98
## 107 106    2   12.25  1.73 2.12       19.0        80    1.65       2.03
## 149 148    3   13.32  3.24 2.38       21.5        92    1.93       0.76
##     Nonflavanoids Proanthocyanins Color  Hue Dilution Proline
## 36           0.26            1.86  5.10 1.04     3.47     920
## 107          0.37            1.63  3.40 1.00     3.17     510
## 149          0.45            1.25  8.42 0.55     1.62     650
## Clustering vector:
##   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   2   2 
##  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81 
##   2   2   1   2   1   2   2   2   1   2   1   2   1   1   2   2   2   2   1   2 
##  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 
##   2   2   3   2   2   2   2   2   2   2   2   2   2   2   1   1   2   1   2   2 
## 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 
##   2   2   2   2   2   2   2   2   1   2   2   2   2   2   2   2   2   2   2   2 
## 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 
##   1   2   2   2   2   2   2   2   2   3   3   3   3   3   3   3   3   3   3   3 
## 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 
##   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
## 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 
##   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3 
## Objective function:
##    build     swap 
## 3.537365 3.504175 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"

Medoids给出了中心点，用第35个观测代表第1类，第106个观测代表第2类，第148个观测代表第3类。Clustering vector给出了每一个观测分别属于哪一个类。结果可以画出来：

clusplot(fit.pam, main = "PAM cluster")

plot of chunk unnamed-chunk-20

同样也可以用factoextra包实现可视化。

fviz_cluster(fit.pam, 
             ellipse = T, # 增加椭圆
             ellipse.type = "t", # 椭圆类型
             geom = "point", # 只显示点不要文字
             palette = "aaas", # 支持超多配色方案
             ggtheme = theme_bw() # 支持更换主题
             )

plot of chunk unnamed-chunk-21

以后会给大家带来factoextra和factomineR包的详细介绍。

咨询交流等，欢迎加入🐧QQ交流群：613637742

往期推荐

使用mlr3搞定二分类资料的多个模型评价和比较

R语言和医学统计学系列（1）：t检验

R语言和医学统计学系列（2）：方差分析

R语言和医学统计学系列（3）：卡方检验

R语言和医学统计学系列（4）：秩和检验

R语言和医学统计学系列（5）：多因素方差分析

R语言和医学统计学系列（6）：重复测量方差分析

R语言和医学统计学系列（7）：多元线性回归

R语言和医学统计学系列（8）：logistic回归

R语言和医学统计学系列（9）：多重检验

R语言和医学统计学系列（10）：正态性和方差齐性检验

R语言和医学统计学系列（11）：球形检验

R语言和医学统计学系列（12）：双变量回归与相关

R语言tidy风格医学统计学

R语言tidy风格医学统计学02

R语言和医学统计学（合辑）

R语言和医学统计学：非参数检验的补充

R语言和医学统计学系列：协方差分析

R语言和医学统计学系列：样本量计

李尚福、魏凤和双双被拿下，与美国一份报告是否有关？

圈内疯传某谣言

不要放过这些人渣

“辣条二哥”直追卫龙，麻辣王子工厂遇洪受关注

中金跳楼女生诡异之处。

R语言聚类分析(1)

系统聚类（层次聚类,Hierarchical clustering）

快速聚类（划分聚类,partitioning clustering）

K-means聚类

围绕中心点的划分PAM

您可能也对以下帖子感兴趣

李尚福、魏凤和双双被拿下，与美国一份报告是否有关？

圈内疯传某谣言

不要放过这些人渣

“辣条二哥”直追卫龙，麻辣王子工厂遇洪受关注

中金跳楼女生诡异之处。

生成图片，分享到微信朋友圈

R语言聚类分析(1)

系统聚类（层次聚类,Hierarchical clustering）

快速聚类（划分聚类,partitioning clustering）

K-means聚类

围绕中心点的划分PAM

您可能也对以下帖子感兴趣