当前位置:   article > 正文

聚类算法——Birch详解_birch聚类算法

birch聚类算法

1 原理

1.1 B树

(1)m路查找树

一棵m路查找树,它或者是一棵空树,或者是满足如下性质的树:

  • 根最多有m棵子树,并具有以下结构:

n,A_{0},(K_{1},A_{1}),(K_{2},A_{2}),......,(K_{n},A_{n}),A_{i}是指向子树的指针,K_{i}是关键码,K_{i}< K_{i+1}

  • 在子树A_{i}中所有的关键码都大于K_{i},小于K_{i+1}
  • 在子树A_{0}中所有的关键码都大于K_{1}
  • 在子树A_{n}中所有的关键码都小于K_{n}
  • 子树A_{i}也是m路查找树

(2)B树

m阶B树时一棵m路查找树,它或是空树,或者满足以下性质:

  • 树中每个节点至多有m棵子树
  • 根节点至少有两棵子树
  • 除根节点以外的所有非终端节点至少有\left \lceil m/2 \right \rceil棵子树
  • 所有的叶子节点都位于同一层

1.2 步骤

        具体模拟过程参考:https://www.cnblogs.com/pinard/p/6179132.html

      参考资料:

BIRCH能够识别出数据集中数据分布的不均衡性,将分布在稠密区域中的点聚类并移除将分布在稀疏区域中的异常点。此外,BIRCH是一种增量聚类方法,针对每一个点的聚类决策都是基于当前已经处理过的数据点,而不是全局的数据点。
① 建立一个聚类特征树
首先是遍历所有数据,使用给定的内存数量和磁盘上的回收空间构建一个初始的内存CF树,来反映数据集上的聚类信息。对于稠密数据,分组成更精细的簇,稀疏的数据点则被作为异常点移除。
② 缩小范围,精简聚类特征树
该过程是可选择的,这个部分是连接步骤①和步骤③的桥梁,相似于步骤①,他开始遍历初始化的聚类特征树叶子节点,移除更多的异常点和缩小范围进行分组。
③ 全局聚类
使用全局聚类或者半全局聚类来操作所有的叶子节点,有数据点的聚类算法很容易适应一组子簇,每个子簇由其聚类特征向量表示。计算子簇的质心,然后每个子簇用质心表示,这部分可以捕捉到数据的主要分布规律。
④ 簇类细化
因为步骤③只是对数据进行粗略总结,原数据只是被扫描了一次,需要继续完善簇类。使用上阶段产生的簇的中心作为种子,并将数据点重新分配到最近的种子,以获得一组新的簇。这不仅允许属于该子簇的点迁移,而且可以确保给定数据点的所有副本都迁移到同一个子簇中。还为提供了丢弃异常值的选项。也就是说,如果距离最近的点太远,种子可以作为离群值处理,而不包含在结果中。

2.参数说明

函数:sklearn.cluster.Birch

参数:

  • threshold:(float,default=0.5)新的子聚类和最近的子聚类合并的子聚类的半径小于阈值,否则将进行分裂。
  • branching_factor:(int,default=50)每个结点中CF子聚类的最大数量。
  • n_cluster:(int,default=3)最终聚类步骤的聚类数量,if None,不执行最终的聚类步骤,子聚类原样返回;if sklearn.cluster.Estimator,则该模型将子聚类作为新样本执行。
  • compute_labels:(bool,default=True)每次拟合的时候是否标签值计算。
  • copy:(bool,default=True)是否复制获得的数据,如果设置为false,初始化数据将被重写。

属性:

  • root_:CF tree的root
  • dummy_leaf_:所有叶子节点的指针
  • subcluster_centers_:所有叶子里子聚类的质心
  • subcluster_labels_:全聚类之后子聚类质心的labels
  • labels_:所有输入数据的labels

3 具体实现

可参考scikit-learn的实例:https://scikit-learn.org/stable/auto_examples/cluster/plot_birch_vs_minibatchkmeans.html#sphx-glr-auto-examples-cluster-plot-birch-vs-minibatchkmeans-py

4 源码解析

源码在:Anaconda3/Lib/site-packages/sklearn/cluster/birch.py中

(1)前缀知识

  • hasattr()函数用来判断某个类实例对象是否包含指定名称的属性或方法,返回True和False

hasattr(obj, name),其中obj 指的是某个类的实例对象,name 表示指定的属性名或方法名。

  • getattr()函数获取某个类实例对象中指定属性的值

getattr(obj, name[, default]),其中obj 表示指定的类实例对象,name 表示指定的属性名,而 default 是可选参数,用于设定该函数的默认返回值,即当函数查找失败时,如果不指定 default 参数,则程序将直接报 AttributeError 错误,反之该函数将返回 default 指定的值。

  • setattr()函数的功能相对比较复杂,它最基础的功能是修改类实例对象中的属性值。其次,它还可以实现为实例对象动态添加属性或者方法。

(2)Birch函数

  • Birch(BaseEstimator, TransformerMixin, ClusterMixin)在sklearn的base文件里
  • 其他参数

  • fit函数(主要核心计算在_fit函数中)
  1. def fit(self, X, y=None):
  2. """
  3. Build a CF Tree for the input data.
  4. Parameters
  5. ----------
  6. X : {array-like, sparse matrix} of shape (n_samples, n_features)
  7. Input data.
  8. y : Ignored
  9. Not used, present here for API consistency by convention.
  10. Returns
  11. -------
  12. self
  13. Fitted estimator.
  14. """
  15. self.fit_, self.partial_fit_ = True, False
  16. return self._fit(X)
  17. def _fit(self, X):
  18. X = self._validate_data(X, accept_sparse='csr', copy=self.copy)
  19. threshold = self.threshold
  20. branching_factor = self.branching_factor
  21. if branching_factor <= 1:
  22. raise ValueError("Branching_factor should be greater than one.")
  23. n_samples, n_features = X.shape
  24. # If partial_fit is called for the first time or fit is called, we
  25. # start a new tree.
  26. partial_fit = getattr(self, 'partial_fit_')
  27. has_root = getattr(self, 'root_', None)
  28. if getattr(self, 'fit_') or (partial_fit and not has_root):
  29. # The first root is the leaf. Manipulate this object throughout.
  30. self.root_ = _CFNode(threshold=threshold,
  31. branching_factor=branching_factor,
  32. is_leaf=True,
  33. n_features=n_features)
  34. # To enable getting back subclusters.
  35. self.dummy_leaf_ = _CFNode(threshold=threshold,
  36. branching_factor=branching_factor,
  37. is_leaf=True, n_features=n_features)
  38. self.dummy_leaf_.next_leaf_ = self.root_
  39. self.root_.prev_leaf_ = self.dummy_leaf_
  40. # Cannot vectorize. Enough to convince to use cython.
  41. if not sparse.issparse(X):
  42. iter_func = iter
  43. else:
  44. iter_func = _iterate_sparse_X
  45. #遍历数据,构建子聚类
  46. for sample in iter_func(X):
  47. subcluster = _CFSubcluster(linear_sum=sample)
  48. split = self.root_.insert_cf_subcluster(subcluster)
  49. #如果确定CF要分裂,则使用分裂算法,返回两个子聚类,并将子聚类增加到root上
  50. if split:
  51. new_subcluster1, new_subcluster2 = _split_node(
  52. self.root_, threshold, branching_factor)
  53. del self.root_
  54. self.root_ = _CFNode(threshold=threshold,
  55. branching_factor=branching_factor,
  56. is_leaf=False,
  57. n_features=n_features)
  58. self.root_.append_subcluster(new_subcluster1)
  59. self.root_.append_subcluster(new_subcluster2)
  60. #获取叶子节点的质心
  61. centroids = np.concatenate([
  62. leaf.centroids_ for leaf in self._get_leaves()])
  63. self.subcluster_centers_ = centroids
  64. self._global_clustering(X)
  65. return self

其他函数:

构建稀疏矩阵

  1. def _iterate_sparse_X(X):
  2. """This little hack returns a densified row when iterating over a sparse
  3. matrix, instead of constructing a sparse matrix for every row that is
  4. expensive.
  5. """
  6. n_samples = X.shape[0]
  7. X_indices = X.indices
  8. X_data = X.data
  9. X_indptr = X.indptr
  10. for i in range(n_samples):
  11. row = np.zeros(X.shape[1])
  12. startptr, endptr = X_indptr[i], X_indptr[i + 1]
  13. nonzero_indices = X_indices[startptr:endptr]
  14. row[nonzero_indices] = X_data[startptr:endptr]
  15. yield row

分裂叶子结点的函数:定义两个子聚类,两个CF节点,并将CF节点加入到CF子聚类中,如果传入的子聚类是叶子节点,就进行一系列的指针变换,计算子聚类的质心和平方和之间的距离,选择距离最大的矩阵,然后选择较小的值为一个子聚类,其他的归为另一个子聚类。

  1. def _split_node(node, threshold, branching_factor):
  2. """The node has to be split if there is no place for a new subcluster
  3. in the node.
  4. 1. Two empty nodes and two empty subclusters are initialized.
  5. 2. The pair of distant subclusters are found.
  6. 3. The properties of the empty subclusters and nodes are updated
  7. according to the nearest distance between the subclusters to the
  8. pair of distant subclusters.
  9. 4. The two nodes are set as children to the two subclusters.
  10. """
  11. new_subcluster1 = _CFSubcluster()
  12. new_subcluster2 = _CFSubcluster()
  13. new_node1 = _CFNode(
  14. threshold=threshold, branching_factor=branching_factor,
  15. is_leaf=node.is_leaf,
  16. n_features=node.n_features)
  17. new_node2 = _CFNode(
  18. threshold=threshold, branching_factor=branching_factor,
  19. is_leaf=node.is_leaf,
  20. n_features=node.n_features)
  21. new_subcluster1.child_ = new_node1
  22. new_subcluster2.child_ = new_node2
  23. if node.is_leaf:
  24. if node.prev_leaf_ is not None:
  25. node.prev_leaf_.next_leaf_ = new_node1
  26. new_node1.prev_leaf_ = node.prev_leaf_
  27. new_node1.next_leaf_ = new_node2
  28. new_node2.prev_leaf_ = new_node1
  29. new_node2.next_leaf_ = node.next_leaf_
  30. if node.next_leaf_ is not None:
  31. node.next_leaf_.prev_leaf_ = new_node2
  32. dist = euclidean_distances(
  33. node.centroids_, Y_norm_squared=node.squared_norm_, squared=True)
  34. n_clusters = dist.shape[0]
  35. farthest_idx = np.unravel_index(
  36. dist.argmax(), (n_clusters, n_clusters))
  37. node1_dist, node2_dist = dist[(farthest_idx,)]
  38. node1_closer = node1_dist < node2_dist
  39. for idx, subcluster in enumerate(node.subclusters_):
  40. if node1_closer[idx]:
  41. new_node1.append_subcluster(subcluster)
  42. new_subcluster1.update(subcluster)
  43. else:
  44. new_node2.append_subcluster(subcluster)
  45. new_subcluster2.update(subcluster)
  46. return new_subcluster1, new_subcluster2

获取叶子结点:

  1. def _get_leaves(self):
  2. """
  3. Retrieve the leaves of the CF Node.
  4. Returns
  5. -------
  6. leaves : list of shape (n_leaves,)
  7. List of the leaf nodes.
  8. """
  9. leaf_ptr = self.dummy_leaf_.next_leaf_
  10. leaves = []
  11. while leaf_ptr is not None:
  12. leaves.append(leaf_ptr)
  13. leaf_ptr = leaf_ptr.next_leaf_
  14. return leaves

进行全局聚类:增加了AgglomerativeClustering算法(另写)。

  1. def _global_clustering(self, X=None):
  2. """
  3. Global clustering for the subclusters obtained after fitting
  4. """
  5. clusterer = self.n_clusters
  6. centroids = self.subcluster_centers_
  7. compute_labels = (X is not None) and self.compute_labels
  8. # Preprocessing for the global clustering.
  9. not_enough_centroids = False
  10. if isinstance(clusterer, numbers.Integral):
  11. clusterer = AgglomerativeClustering(
  12. n_clusters=self.n_clusters)
  13. # There is no need to perform the global clustering step.
  14. if len(centroids) < self.n_clusters:
  15. not_enough_centroids = True
  16. elif (clusterer is not None and not
  17. hasattr(clusterer, 'fit_predict')):
  18. raise ValueError("n_clusters should be an instance of "
  19. "ClusterMixin or an int")
  20. # To use in predict to avoid recalculation.
  21. self._subcluster_norms = row_norms(
  22. self.subcluster_centers_, squared=True)
  23. if clusterer is None or not_enough_centroids:
  24. self.subcluster_labels_ = np.arange(len(centroids))
  25. if not_enough_centroids:
  26. warnings.warn(
  27. "Number of subclusters found (%d) by Birch is less "
  28. "than (%d). Decrease the threshold."
  29. % (len(centroids), self.n_clusters), ConvergenceWarning)
  30. else:
  31. # The global clustering step that clusters the subclusters of
  32. # the leaves. It assumes the centroids of the subclusters as
  33. # samples and finds the final centroids.
  34. self.subcluster_labels_ = clusterer.fit_predict(
  35. self.subcluster_centers_)
  36. if compute_labels:
  37. self.labels_ = self.predict(X)

 

(3)CFNode

参数属性
threshold:float确定子聚类的阈值subclusters_ : list指定结点的子聚类
branching_factor: int分支因子prev_leaf_ : _CFNode前叶子结点
is_leaf : bool是否是叶子节点next_leaf_ : _CFNode后叶子结点
n_features : int特征数量init_centroids_ 初始化质心,shape=(branching_factor + 1, n_features)
  init_sq_norm_ 初始化平方和,shape=(branching_factor + 1, n_features)
  centroids_质心
  squared_norm_平方和

 

CFNode有三个函数构成:

第一个函数:append_subcluster(self, subcluster)更新CF的特征值

  1. def append_subcluster(self, subcluster):
  2. #获取CF的子聚类长度
  3. n_samples = len(self.subclusters_)
  4. #将新的子聚类加入到CF中
  5. self.subclusters_.append(subcluster)
  6. #初始化新子聚类的质心和平方和(将质心和平和方加入到列表中)
  7. self.init_centroids_[n_samples] = subcluster.centroid_
  8. self.init_sq_norm_[n_samples] = subcluster.sq_norm_
  9. # Keep centroids and squared norm as views. In this way
  10. # if we change init_centroids and init_sq_norm_, it is
  11. # sufficient,
  12. #更新最终的子聚类的质心和平方和(将质心和平和方加入到列表中)
  13. self.centroids_ = self.init_centroids_[:n_samples + 1, :]
  14. self.squared_norm_ = self.init_sq_norm_[:n_samples + 1]

第二个函数:update_split_subclusters(self, subcluster,new_subcluster1, new_subcluster2):更新分裂节点

  1. def update_split_subclusters(self, subcluster,
  2. new_subcluster1, new_subcluster2):
  3. """Remove a subcluster from a node and update it with the
  4. split subclusters.
  5. """
  6. ind = self.subclusters_.index(subcluster)
  7. self.subclusters_[ind] = new_subcluster1
  8. self.init_centroids_[ind] = new_subcluster1.centroid_
  9. self.init_sq_norm_[ind] = new_subcluster1.sq_norm_
  10. self.append_subcluster(new_subcluster2)

第三个函数:insert_cf_subcluster(self, subcluster):子聚类中插入CF特征

  1. def insert_cf_subcluster(self, subcluster):
  2. """Insert a new subcluster into the node."""
  3. # self.subclusters_不存在,则将新的子聚类加入到子聚类列表中
  4. if not self.subclusters_:
  5. self.append_subcluster(subcluster)
  6. return False
  7. threshold = self.threshold
  8. branching_factor = self.branching_factor
  9. # We need to find the closest subcluster among all the
  10. # subclusters so that we can insert our new subcluster.
  11. #计算距离矩阵
  12. dist_matrix = np.dot(self.centroids_, subcluster.centroid_)
  13. dist_matrix *= -2.
  14. dist_matrix += self.squared_norm_
  15. closest_index = np.argmin(dist_matrix)
  16. closest_subcluster = self.subclusters_[closest_index]
  17. # If the subcluster has a child, we need a recursive strategy.
  18. #如果子聚类存在字迹,需要采用递归策略,更新CF参数
  19. if closest_subcluster.child_ is not None:
  20. split_child = closest_subcluster.child_.insert_cf_subcluster(
  21. subcluster)
  22. if not split_child:
  23. # If it is determined that the child need not be split, we
  24. # can just update the closest_subcluster
  25. closest_subcluster.update(subcluster)
  26. self.init_centroids_[closest_index] = \
  27. self.subclusters_[closest_index].centroid_
  28. self.init_sq_norm_[closest_index] = \
  29. self.subclusters_[closest_index].sq_norm_
  30. return False
  31. # things not too good. we need to redistribute the subclusters in
  32. # our child node, and add a new subcluster in the parent
  33. # subcluster to accommodate the new child.
  34. else:
  35. new_subcluster1, new_subcluster2 = _split_node(
  36. closest_subcluster.child_, threshold, branching_factor)
  37. self.update_split_subclusters(
  38. closest_subcluster, new_subcluster1, new_subcluster2)
  39. if len(self.subclusters_) > self.branching_factor:
  40. return True
  41. return False
  42. # good to go!
  43. else:
  44. #当子聚类的残差半径小于阈值时,更新CF参数
  45. merged = closest_subcluster.merge_subcluster(
  46. subcluster, self.threshold)
  47. #如果merged存在,将新的子聚类加入到子聚类中,并更新子聚类的参数
  48. if merged:
  49. self.init_centroids_[closest_index] = \
  50. closest_subcluster.centroid_
  51. self.init_sq_norm_[closest_index] = \
  52. closest_subcluster.sq_norm_
  53. return False
  54. # not close to any other subclusters, and we still
  55. # have space, so add.
  56. #如果子聚类的CF树超过分支因子数,分裂成新的子聚类加入到Node中
  57. elif len(self.subclusters_) < self.branching_factor:
  58. self.append_subcluster(subcluster)
  59. return False
  60. # We do not have enough space nor is it closer to an
  61. # other subcluster. We need to split.
  62. else:
  63. self.append_subcluster(subcluster)
  64. return True

(4)CFSubcluster

参数属性
linear_sum:narray样本n_samples_ :int每个子聚类的样本数
  linear_sum_ : narray子聚类所有样本的线性和
  squared_sum_ : floatSum of the squared l2 norms
  centroids_ 质心
  child_孩子结点
  sq_norm_ 子聚类的平方和

CFSubcluster有三个函数构成:

第一个函数:update(self, subcluster)更新数值(线性和、质心、平方和等数值)

  1. def update(self, subcluster):
  2. self.n_samples_ += subcluster.n_samples_
  3. self.linear_sum_ += subcluster.linear_sum_
  4. self.squared_sum_ += subcluster.squared_sum_
  5. self.centroid_ = self.linear_sum_ / self.n_samples_
  6. self.sq_norm_ = np.dot(self.centroid_, self.centroid_)

第二个函数:merge_subcluster(self, nominee_cluster, threshold)连接subclustert

  1. def merge_subcluster(self, nominee_cluster, threshold):
  2. """Check if a cluster is worthy enough to be merged. If
  3. yes then merge.
  4. """
  5. new_ss = self.squared_sum_ + nominee_cluster.squared_sum_
  6. new_ls = self.linear_sum_ + nominee_cluster.linear_sum_
  7. new_n = self.n_samples_ + nominee_cluster.n_samples_
  8. new_centroid = (1 / new_n) * new_ls
  9. new_norm = np.dot(new_centroid, new_centroid)
  10. dot_product = (-2 * new_n) * new_norm
  11. sq_radius = (new_ss + dot_product) / new_n + new_norm
  12. if sq_radius <= threshold ** 2:
  13. (self.n_samples_, self.linear_sum_, self.squared_sum_,
  14. self.centroid_, self.sq_norm_) = \
  15. new_n, new_ls, new_ss, new_centroid, new_norm
  16. return True
  17. return False

第三个函数:radius(self):计算残差

  1. def radius(self):
  2. """Return radius of the subcluster"""
  3. dot_product = -2 * np.dot(self.linear_sum_, self.centroid_)
  4. return sqrt(
  5. ((self.squared_sum_ + dot_product) / self.n_samples_) +
  6. self.sq_norm_)

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/我家自动化/article/detail/499636
推荐阅读
相关标签
  

闽ICP备14008679号