赞
踩
聚类和分类判别有什么区别?
距离的定义
常用距离
– 欧氏距离,euclidean——通常意义下的距离
– 马氏距离,manhattan——考虑到变量间的相关性,并且与变量的单位无关
– 余弦距离,cosine——衡量变量相似性
思想
1 开始时,每个样本各自作为一类
2 规定某种度量作为样本之间的距离及类与类之间的距离,并计算之
3 将距离最短的两个类合并为一个新类
4 重复2-3,即不断合并最近的两个类,每次减少一个类,直至所有样本被合并为一类
离差平方和法——ward
类平均法——average
最大距离法——complete
AgglomerativeClustering(n_clusters=2, affinity=‘euclidean’, memory=Memory(cachedir=None), connectivity=None, n_components=None, compute_full_tree=‘auto’, linkage=‘ward’, pooling_func=)
属性
– labels
– n-leaves
– n-components
– children
方法 – fit
– fit_predict
– get_params
– set_params
算法:
1 选择K个点作为初始质心
2 将每个点指派到最近的质心,形成K个簇(聚类)
3 重新计算每个簇的质心
4 重复2-3直至质心不发生变化
KMeans(n_clusters=8, init=‘k-means++’, n_init=10, max_iter=300, tol=0.0001, precompute_distances=‘auto’, verbose=0, random_state=None, copy_x=True, n_jobs=1)
属性
– cluster_centers
– labels – inertia
方法 – fit
– fit_predict – predict
– get_params – set_params
有效率,而且不容易受初始值选择的影响
不能处理非球形的簇
不能处理不同尺寸,不同密度的簇
离群值可能有较大干扰(因此要先剔除)
DBSCAN = Density-Based Spatial Clustering of Applications with Noise
本算法将具有足够高密度的区域划分为簇,并可以发现任何形状的聚类
r-邻域:给定点半径r内的区域
核心点:如果一个点的r-邻域至少包含最少数目M个点,则称该点为核心点
直接密度可达:如果点p在核心点q的r-邻域内,则称p是从q出发可以直接密度可达
如果存在点链p1,p2, …, pn,p1=q,pn=p,pi+1是从pi关于r和M直接密度可达,则称点p 是从q关于r和M密度可达的
如果样本集D中存在点o,使得点p、q是从o关于 r和M密度可达的,那么点p、q是关于r 和M密度相连的
算法基本 思想
1 指定合适的 r 和 M
2 计算所有的样本点,如果点p的r邻域里有超过M个点,则创建一个以p为核心点的新簇
3 反复寻找这些核心点直接密度可达(之后可能是密度可达)的点,将其加入到相应的簇, 对于核心点发生“密度相连”状况的簇,给予合并
4 当没有新的点可以被添加到任何簇时,算法结束
输入: 包含n个对象的数据库,半径e,最少数目MinPts; 输出:所有生成的簇,达到密度要求。
(1)Repeat
(2)从数据库中抽出一个未处理的点;
(3)IF抽出的点是核心点 THEN 找出所有从该点密度可达的对象,形成一个簇; (4)ELSE 抽出的点是边缘点(非核心对象),跳出本次循环,寻找下一个点; (5)UNTIL 所有的点都被处理。
DBSCAN对用户定义的参数很敏感,细微的不同都可能导致差别很大的结果,而参数的选 择无规律可循,只能靠经验确定。
DBSCAN(eps=0.5, min_samples=5, metric=‘euclidean’, algorithm=‘auto’, leaf_size=30, p=None, random_state=None)
属性
– core_sample_indices_
– components_ – labels_
方法 – fit
– fit_predict – get_params – set_params
# coding: utf-8 # In[1]: import numpy as np import pandas as pd from sklearn.cluster import AgglomerativeClustering import matplotlib.pyplot as plt import numpy as np from scipy import ndimage from matplotlib import pyplot as plt from sklearn import manifold, datasets # In[2]: digits = datasets.load_digits(n_class=10) X = digits.data y = digits.target n_samples, n_features = X.shape print X[:5,:] print n_samples,n_features # In[3]: # Visualize the clustering def plot_clustering(X_red, X, labels, title=None): x_min, x_max = np.min(X_red, axis=0), np.max(X_red, axis=0) X_red = (X_red - x_min) / (x_max - x_min) plt.figure(figsize=(6, 4)) for i in range(X_red.shape[0]): plt.text(X_red[i, 0], X_red[i, 1], str(y[i]), color=plt.cm.spectral(labels[i] / 10.), fontdict={'weight': 'bold', 'size': 9}) plt.xticks([]) plt.yticks([]) if title is not None: plt.title(title, size=17) plt.axis('off') plt.tight_layout() # In[ ]: # 2D embedding of the digits dataset print("Computing embedding") X_red = manifold.SpectralEmbedding(n_components=2).fit_transform(X) print("Done.") from sklearn.cluster import AgglomerativeClustering for linkage in ('ward', 'average', 'complete'): clustering = AgglomerativeClustering(linkage=linkage, n_clusters=10) clustering.fit(X_red) plot_clustering(X_red, X, clustering.labels_, "%s linkage" % linkage) plt.show() # In[3]: get_ipython().magic(u'matplotlib inline') X0 = np.array([7, 5, 7, 3, 4, 1, 0, 2, 8, 6, 5, 3]) X1 = np.array([5, 7, 7, 3, 6, 4, 0, 2, 7, 8, 5, 7]) plt.figure() plt.axis([-1, 9, -1, 9]) plt.grid(True) plt.plot(X0, X1, 'k.'); # In[4]: C1 = [1, 4, 5, 9, 11] C2 = list(set(range(12)) - set(C1)) X0C1, X1C1 = X0[C1], X1[C1] X0C2, X1C2 = X0[C2], X1[C2] plt.figure() plt.axis([-1, 9, -1, 9]) plt.grid(True) plt.plot(X0C1, X1C1, 'rx') plt.plot(X0C2, X1C2, 'g.') plt.plot(4,6,'rx',ms=12.0) plt.plot(5,5,'g.',ms=12.0); # In[5]: C1 = [1, 2, 4, 8, 9, 11] C2 = list(set(range(12)) - set(C1)) X0C1, X1C1 = X0[C1], X1[C1] X0C2, X1C2 = X0[C2], X1[C2] plt.figure() plt.axis([-1, 9, -1, 9]) plt.grid(True) plt.plot(X0C1, X1C1, 'rx') plt.plot(X0C2, X1C2, 'g.') plt.plot(3.8,6.4,'rx',ms=12.0) plt.plot(4.57,4.14,'g.',ms=12.0); # In[6]: C1 = [0, 1, 2, 4, 8, 9, 10, 11] C2 = list(set(range(12)) - set(C1)) X0C1, X1C1 = X0[C1], X1[C1] X0C2, X1C2 = X0[C2], X1[C2] plt.figure() plt.axis([-1, 9, -1, 9]) plt.grid(True) plt.plot(X0C1, X1C1, 'rx') plt.plot(X0C2, X1C2, 'g.') plt.plot(5.5,7.0,'rx',ms=12.0) plt.plot(2.2,2.8,'g.',ms=12.0); # In[7]: cluster1 = np.random.uniform(0.5, 1.5, (2, 10)) cluster2 = np.random.uniform(3.5, 4.5, (2, 10)) X = np.hstack((cluster1, cluster2)).T plt.figure() plt.axis([0, 5, 0, 5]) plt.grid(True) plt.plot(X[:,0],X[:,1],'k.'); # In[8]: from sklearn.cluster import KMeans from scipy.spatial.distance import cdist K = range(1, 10) meandistortions = [] for k in K: kmeans = KMeans(n_clusters=k) kmeans.fit(X) meandistortions.append(sum(np.min(cdist(X, kmeans.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0]) plt.plot(K, meandistortions, 'bx-') plt.xlabel('k') plt.ylabel('The average degree of distortion') plt.title('Best k') # In[9]: import numpy as np x1 = np.array([1, 2, 3, 1, 5, 6, 5, 5, 6, 7, 8, 9, 7, 9]) x2 = np.array([1, 3, 2, 2, 8, 6, 7, 6, 7, 1, 2, 1, 1, 3]) X = np.array(list(zip(x1, x2))).reshape(len(x1), 2) plt.figure() plt.axis([0, 10, 0, 10]) plt.grid(True) plt.plot(X[:,0],X[:,1],'k.'); # In[10]: from sklearn.cluster import KMeans from scipy.spatial.distance import cdist K = range(1, 10) meandistortions = [] for k in K: kmeans = KMeans(n_clusters=k) kmeans.fit(X) meandistortions.append(sum(np.min(cdist(X, kmeans.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0]) plt.plot(K, meandistortions, 'bx-') plt.xlabel('k') plt.ylabel('The average degree of distortion') plt.title('Best K') # In[11]: """ =================================== Demo of DBSCAN clustering algorithm =================================== Finds core samples of high density and expands clusters from them. """ print(__doc__) import numpy as np from sklearn.cluster import DBSCAN from sklearn import metrics from sklearn.datasets.samples_generator import make_blobs from sklearn.preprocessing import StandardScaler ############################################################################## # Generate sample data centers = [[1, 1], [-1, -1], [1, -1]] X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4, random_state=0) X = StandardScaler().fit_transform(X) ############################################################################## # Compute DBSCAN db = DBSCAN(eps=0.3, min_samples=10).fit(X) core_samples_mask = np.zeros_like(db.labels_, dtype=bool) core_samples_mask[db.core_sample_indices_] = True labels = db.labels_ # Number of clusters in labels, ignoring noise if present. n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) print('Estimated number of clusters: %d' % n_clusters_) print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels)) print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels)) print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels)) print("Adjusted Rand Index: %0.3f" % metrics.adjusted_rand_score(labels_true, labels)) print("Adjusted Mutual Information: %0.3f" % metrics.adjusted_mutual_info_score(labels_true, labels)) print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X, labels)) ############################################################################## # Plot result import matplotlib.pyplot as plt # Black removed and is used for noise instead. unique_labels = set(labels) colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels))) for k, col in zip(unique_labels, colors): if k == -1: # Black used for noise. col = 'k' class_member_mask = (labels == k) xy = X[class_member_mask & core_samples_mask] plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=14) xy = X[class_member_mask & ~core_samples_mask] plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=6) plt.title('Estimated number of clusters: %d' % n_clusters_) plt.show() # In[ ]:
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。