赞
踩
层次聚类假设类别之间存在层次结构,将样本聚到层次化的类中。层次聚类分为自下而上、自上而下聚类的两种方法。由于每个样本只能属于一个类别,因此层次聚类属于硬聚类。
基本原理(本文只讲述自下而上的聚合方法):
scipy中的两个函数即可完成此任务。
scipy.cluster.hierarchy.linkage(data, method=‘average’, metric=“euclidean”)用于层次聚类即完成上述的3步
metric的参数
metric : str or function, optional
The distance metric to use. The distance function can
be 'braycurtis', 'canberra', 'chebyshev', 'cityblock',
'correlation', 'cosine', 'dice', 'euclidean', 'hamming',
'jaccard', 'jensenshannon', 'kulsinski', 'mahalanobis', 'matching',
'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean',
'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule'.
methord的参数
* method='single' assigns .. math:: d(u,v) = \\min(dist(u[i],v[j])) for all points :math:`i` in cluster :math:`u` and :math:`j` in cluster :math:`v`. This is also known as the Nearest Point Algorithm. * method='complete' assigns .. math:: d(u, v) = \\max(dist(u[i],v[j])) for all points :math:`i` in cluster u and :math:`j` in cluster :math:`v`. This is also known by the Farthest Point Algorithm or Voor Hees Algorithm. * method='average' assigns .. math:: d(u,v) = \\sum_{ij} \\frac{d(u[i], v[j])} {(|u|*|v|)} for all points :math:`i` and :math:`j` where :math:`|u|` and :math:`|v|` are the cardinalities of clusters :math:`u` and :math:`v`, respectively. This is also called the UPGMA algorithm. * method='weighted' assigns .. math:: d(u,v) = (dist(s,v) + dist(t,v))/2 where cluster u was formed with cluster s and t and v is a remaining cluster in the forest (also called WPGMA). * method='centroid' assigns .. math:: dist(s,t) = ||c_s-c_t||_2 where :math:`c_s` and :math:`c_t` are the centroids of clusters :math:`s` and :math:`t`, respectively. When two clusters :math:`s` and :math:`t` are combined into a new cluster :math:`u`, the new centroid is computed over all the original objects in clusters :math:`s` and :math:`t`. The distance then becomes the Euclidean distance between the centroid of :math:`u` and the centroid of a remaining cluster :math:`v` in the forest. This is also known as the UPGMC algorithm. * method='median' assigns :math:`d(s,t)` like the ``centroid`` method. When two clusters :math:`s` and :math:`t` are combined into a new cluster :math:`u`, the average of centroids s and t give the new centroid :math:`u`. This is also known as the WPGMC algorithm. * method='ward' uses the Ward variance minimization algorithm. The new entry :math:`d(u,v)` is computed as follows, .. math:: d(u,v) = \\sqrt{\\frac{|v|+|s|} {T}d(v,s)^2 + \\frac{|v|+|t|} {T}d(v,t)^2 - \\frac{|v|} {T}d(s,t)^2} where :math:`u` is the newly joined cluster consisting of clusters :math:`s` and :math:`t`, :math:`v` is an unused cluster in the forest, :math:`T=|v|+|s|+|t|`, and :math:`|*|` is the cardinality of its argument. This is also known as the incremental algorithm.
scipy.cluster.hierarchy.dendrogram(Z, labels=label, above_threshold_color=‘C0’)主要用于画层次聚类图
完整代码如下:
from matplotlib import pyplot as plt from sklearn.datasets import load_iris from scipy.cluster import hierarchy # 层次聚类 import matplotlib as mpl mpl.rcParams['font.sans-serif'] = ['FangSong'] # 指定中文字体 mpl.rcParams['axes.unicode_minus'] = False # 解决保存图像是负号'-'显示为方块的问题 plt.rcParams['font.sans-serif'] = ['SimHei'] plt.rcParams['axes.unicode_minus'] = False # 正常显示负号 iris = load_iris() data = iris.data label = iris.target fig = plt.subplots(1, 1, figsize=(50, 8)) # figsize为画布大小 Z = hierarchy.linkage(data, method='average', metric="euclidean") # 计算合并类的方法,这里是取平均距离,距离用的是欧氏距离 hierarchy.dendrogram(Z, labels=label, above_threshold_color='C0') # 画层次聚类图 plt.plot(linewidth=1.0) plt.xticks(fontsize=14, rotation=0) # x轴标签字体大小与方向调整 plt.rcParams['savefig.dpi'] = 200 # 图片像素 plt.rcParams['figure.dpi'] = 200 # 分辨率 plt.tight_layout() # 自动调整子图参数,使之填充整个图像区域 plt.savefig("H_iris.png", dpi=100, bbox_inches='tight') # 保存图片 plt.show()
用的数据为鸢尾花数据集,可见0和1、2之间的区别是非常明确的,1、2之间的区分也是能够明显看的出来的,
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。