赞
踩
AGNES(AGglomerative NESting 的简写)是一种采用自底向上聚合策略的层次聚类算法。
【工作过程】:
【关键】:如何计算聚类簇之间的距离。
实际上,每个簇是一个样本集合,因此,只需采用关于集合的某种距离即可。
最小距离:
d
m
i
n
(
C
i
,
C
j
)
=
m
i
n
x
∈
C
i
,
z
∈
C
j
d
i
s
t
(
x
,
z
)
,
最大距离:
d
m
a
x
(
C
i
,
C
j
)
=
m
a
x
x
∈
C
i
,
z
∈
C
j
d
i
s
t
(
x
,
z
)
,
平均距离:
d
a
v
g
(
C
i
,
C
j
)
=
1
∣
C
i
∣
∣
C
j
∣
∑
x
∈
C
i
∑
z
∈
C
j
d
i
s
t
(
x
,
z
)
.
\text{最小距离:}d_{min}(C_i, C_j) = min_{x\in C_i,z\in C_j}dist(x,z), \\ \text{最大距离:}d_{max}(C_i, C_j) = max_{x\in C_i,z\in C_j}dist(x,z), \\ \text{平均距离:}d_{avg}(C_i, C_j) = \frac{1}{|C_i||C_j|}\sum_{x\in C_i}\sum_{z\in C_j}dist(x,z).
最小距离:dmin(Ci,Cj)=minx∈Ci,z∈Cjdist(x,z),最大距离:dmax(Ci,Cj)=maxx∈Ci,z∈Cjdist(x,z),平均距离:davg(Ci,Cj)=∣Ci∣∣Cj∣1x∈Ci∑z∈Cj∑dist(x,z).
显然,最小距离由两个簇的最近样本决定,最大距离由两个簇的最远样本决定,而平均距离则由两个簇的所有样本共同决定。
当聚类簇距离为
此外,豪斯多夫距离(Hausdorff distance)也可用于集合间的距离计算。关于豪斯多夫距离的介绍可参考这篇博文豪斯多夫距离
【算法描述】:
id | density | sugar content | id | density | sugar content | id | density | sugar content |
---|---|---|---|---|---|---|---|---|
1 | 0.697 | 0.460 | 11 | 0.245 | 0.057 | 21 | 0.748 | 0.232 |
2 | 0.774 | 0.376 | 12 | 0.343 | 0.099 | 22 | 0.714 | 0.346 |
3 | 0.634 | 0.264 | 13 | 0.639 | 0.161 | 23 | 0.483 | 0.312 |
4 | 0.608 | 0.318 | 14 | 0.657 | 0.198 | 24 | 0.478 | 0.437 |
5 | 0.556 | 0.215 | 15 | 0.360 | 0.370 | 25 | 0.525 | 0.369 |
6 | 0.403 | 0.237 | 16 | 0.593 | 0.042 | 26 | 0.751 | 0.489 |
7 | 0.481 | 0.149 | 17 | 0.719 | 0.103 | 27 | 0.532 | 0.472 |
8 | 0.437 | 0.211 | 18 | 0.359 | 0.188 | 28 | 0.473 | 0.376 |
9 | 0.666 | 0.091 | 19 | 0.339 | 0.241 | 29 | 0.725 | 0.445 |
10 | 0.243 | 0.267 | 20 | 0.282 | 0.257 | 30 | 0.446 | 0.459 |
以上述数据集(《机器学习》西瓜集 4.0)为例,令 AGNES 算法一直执行到所有样本出现在同一个簇中,即 k = 1,即可得到下图所示的“树状图”(dendrogram),其中每层链接一组聚类簇。
观测树状图的合并结果,可以发现该合并结果是 AGNES 算法以“最大距离”作为距离度量标准进行合并。此外,在树状图的特定层次上进行分割,则可得到相应的簇划分结果。例如,以图中所示虚线分割树状图,将得到包含 7 个聚类簇的结果。
将分割层逐步提升,则可得到聚类簇逐渐减小的聚类结果。下图显示 AGNES 算法产生 7 至 4 个聚类簇的划分结果。
首先实现距离计算函数。
def get_dist(XA, XB, type='min'):
if len(XA.shape) == 1:
XA = np.array([XA])
if len(XB.shape) == 1:
XB = np.array([XB])
dist = 0
if type == 'min':
dist = cdist(XA, XB, 'euclidean').min()
elif type == 'max':
dist = cdist(XA, XB, 'euclidean').max()
else:
dist = cdist(XA, XB, 'euclidean').sum() / XA.shape[0] / XB.shape[0]
return dist
然后编写 AGNES 函数的实体:
def AGNES(dataset, k, dist_method='avg'):
# 获取样本集长度
length = dataset.shape[0]
# 初始化聚类簇和距离矩阵
clusters, dist_matrix = [], np.mat(np.zeros((length, length)))
for data in dataset:
clusters.append(data)
for i in range(length):
for j in range(length):
if i == j:
dist = np.inf
else:
dist = get_dist(clusters[i], clusters[j], dist_method)
dist_matrix[i, j] = dist
dist_matrix[j, i] = dist
cluster_count = length
while cluster_count > k:
first, second = np.where(dist_matrix == dist_matrix.min())[0]
clusters[first] = np.vstack((cluters[first], clusters[second]))
for i in range(second + 1, cluster_count):
clusters[i - 1] = clusters[i]
clusters.pop()
dist_matrix = np.delete(dist_matrix, second, axis=0)
dist_matrix = np.delete(dist_matrix, second, axis=1)
for i in range(cluster_count - 1):
if first == i:
dist = np.inf
else:
dist = get_dist(clusters[first], clusters[i], dist_method)
dist_matrix[first, i] = dist
dist_matrix[i, first] = dist
cluster_count -= 1
return clusters
【完整代码】:传送门
def AGNES(dataset, k, dist_method='avg'): length = dataset.shape[0] clusters = [] dist_matrix = np.mat(np.zeros((length, length))) for data in dataset: clusters.append(data) for i in range(length): for j in range(length): if i == j: dist = np.inf else: dist = get_dist(clusters[i], clusters[j], dist_method) dist_matrix[i, j] = dist dist_matrix[j, i] = dist # 设置当前聚类簇的个数 cluster_count = length while cluster_count > k: # 找出距离最近的两个聚类簇 first, second = np.where(dist_matrix == dist_matrix.min())[0] # 合并这两个聚类簇 clusters[first] = np.vstack((clusters[first], clusters[second])) # 重新编号聚类簇 for i in range(second + 1, cluster_count): clusters[i - 1] = clusters[i] clusters.pop() # 删除距离矩阵的第 second 行与列 dist_matrix = np.delete(dist_matrix, second, axis=0) dist_matrix = np.delete(dist_matrix, second, axis=1) # 重新计算距离矩阵第 first 簇与其他簇之间距离 for i in range(cluster_count - 1): if first == i: dist = np.inf else: dist = get_dist(clusters[first], clusters[i], dist_method) dist_matrix[first, i] = dist dist_matrix[i, first] = dist cluster_count -= 1 return clusters
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。