赞
踩
every blog every motto: You can do more than you think.
https://blog.csdn.net/weixin_39190382?type=blog
瞬间感觉kmeans不香了,哈哈哈
说明: 该算法不仅能聚类,还能剔除离群点,聚类以后标签为-1的即噪声点(离群点),剔除即可。
DBSCAN(Density-Based Spatial Clustering of Applications with Noise) 基于密度的空间聚类。根据密度将数据划分成簇,可以在有噪声的空间中发现任意形状的簇。
现在有一群人,要在他们中发展微商。一个人只有发展直系亲属,且人数大于5个,微商才算做成功。通过这个规则,很容易在人群中划分出不同的“微商群”,即实现了聚类。(是不是很简短,有木有~)
如下图,一个点,如果在指定的半径内(上面所说的直系亲属,确定远近关系的)存在一定数量的其他点(上面的大于5,用于确定密度),则称该点为 核心点 ;
继续用半径和数量发展“下线”,直到最后无法满足规则未知,最后的“下线”称为边界点,不在簇中的点为噪声点。
可以看下图理解:
说明:jupyter notebook里面运行
生成样本点
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import numpy as np
import pandas as pd
from sklearn import datasets
X,_ = datasets.make_moons(500,noise = 0.1,random_state=1)
df = pd.DataFrame(X,columns = ['feature1','feature2'])
df.plot.scatter('feature1','feature2', s = 100,alpha = 0.6, title = 'dataset by make_moon');
聚类
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
from sklearn.cluster import dbscan
# eps为邻域半径,min_samples为最少点数目
core_samples,cluster_ids = dbscan(X, eps = 0.2, min_samples=20)
# cluster_ids中-1表示对应的点为噪声点
df = pd.DataFrame(np.c_[X,cluster_ids],columns = ['feature1','feature2','cluster_id'])
df['cluster_id'] = df['cluster_id'].astype('i2')
df.plot.scatter('feature1','feature2', s = 100,
c = list(df['cluster_id']),cmap = 'rainbow',colorbar = False,
alpha = 0.6,title = 'sklearn DBSCAN cluster result');
from sklearn.cluster import OPTICS
from sklearn.datasets import make_blobs
# 生成随机数据
X, y = make_blobs(n_samples=1000, centers=3, random_state=42)
# 使用 OPTICS 算法自适应估计 eps 值
optics = OPTICS(min_samples=10, xi=0.05)
optics.fit(X)
# 输出聚类结果和估计的 eps 值
print("Estimated eps:", optics.eps_)
print("Cluster labels:", optics.labels_)
from sklearn.neighbors import NearestNeighbors
from sklearn.datasets import make_blobs
# 生成随机数据
X, y = make_blobs(n_samples=1000, centers=3, random_state=42)
# 计算 k-距离
knn = NearestNeighbors(n_neighbors=10)
knn.fit(X)
distances, indices = knn.kneighbors(X)
# 估计 eps 值
eps = distances[:, 9].mean()
# 输出估计的 eps 值
print("Estimated eps:", eps)
推荐一个动画网站:
https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/
显示一维数据
def show_scatter_1D(data,labels,noise_black=False):
plt.clf()
if noise_black:
noise_data = data[labels == -1, 0]
plt.scatter(noise_data,np.zeros_like(noise_data), edgecolor='red',c='white', label='Noise')
for l in set(labels) - {-1}:
norm_data = data[labels == l, 0]
plt.scatter(norm_data, np.zeros_like(norm_data), cmap='viridis', label=f'Cluster {l}')
plt.legend()
plt.show()
显示二维数据
def show_scatter_2D(data,labels,noise_black=False):
plt.clf()
if noise_black:
plt.scatter(data[labels == -1, 0], data[labels == -1, 1],edgecolor='red', c='white', label='Noise')
for l in set(labels) - {-1}:
plt.scatter(data[labels == l, 0], data[labels == l, 1], label=f'Cluster {l}')
plt.legend()
plt.show()
[1] https://blog.csdn.net/swy_swy_swy/article/details/106130675
[2] https://blog.csdn.net/Cyrus_May/article/details/113504879
[3] https://blog.csdn.net/huacha__/article/details/81094891#t0
[4] https://zhuanlan.zhihu.com/p/336501183
[5] https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。