赞
踩
基于密度的噪声应用空间聚类(DBSCAN)是一种无监督聚类算法,它可以替代KMeans和层次聚类等流行的聚类算法。
针对这些缺点,人们提出了 DBSCAN 算法
首先该算法用到两个参数:
eps
:领域半径min_samples
:领域半径内的最少点数还有一些基本概念:
eps
的圆内点的数量,如果大于等于 min_samples
,那么该店就是核心点eps
,那么
P
P
P 到
Q
Q
Q 密度直达。注意,密度直达不具有对称性,也就是
P
P
P 到
Q
Q
Q 密度直达,
Q
Q
Q 到
P
P
P 不一定密度直达,因为
Q
Q
Q 可能不是核心点。看完上面的这些概念,对于这个算法的流程可能就已经有些眉目了,该算法的步骤可以大概分为两部分:
又是喜闻乐见的展示代码环节
class DBSCAN:
def __init__(self, eps, min_samples):
self.distance = None
self.visit = None
self.label = None
self.eps = eps
self.min_samples = min_samples
def fit(self, X):
"""
分类 id 从 1 开始,0 为噪声
"""
self._distance(X)
n_samples = X.shape[0]
self.visit = np.zeros(n_samples)
self.label = np.zeros(n_samples)
clusterId = 1
for i in range(n_samples):
if self.visit[i] == 1:
continue
self.visit[i] = 1
neighbors = self._region_query(i)
if len(neighbors) < self.min_samples:
self.label[i] = 0 # 标记为噪声点
else:
self.label[i] = clusterId
self._expand_cluster(neighbors, clusterId)
clusterId += 1
self.label = self.label.astype(int)
return self.label
def _distance(self, X):
self.distance = np.linalg.norm(X[:, np.newaxis] - X, axis=-1)
def _region_query(self, i):
dists = self.distance[i]
neighbors = np.argwhere(dists <= self.eps).squeeze(axis=1)
return neighbors
def _expand_cluster(self, neighbors, clusterId):
j = 0
while j < len(neighbors):
neighbor = neighbors[j]
if self.visit[neighbor] == 0:
self.visit[neighbor] = 1
self.label[neighbor] = clusterId
new_neighbors = self._region_query(neighbor)
if len(new_neighbors) >= self.min_samples:
neighbors = np.concatenate((neighbors, new_neighbors), axis=0)
j += 1
测试代码
import numpy as np
import matplotlib.pyplot as plt
from draw_point import map_label_to_color
def generate_dataset(num_clusters, num_points_per_cluster, noise_ratio):
# 确定数据集的范围
x_range = (-10, 10)
y_range = (-10, 10)
# 生成聚类中心点
cluster_centers = []
for _ in range(num_clusters):
center_x = np.random.uniform(*x_range)
center_y = np.random.uniform(*y_range)
cluster_centers.append((center_x, center_y))
# 生成数据点
dataset = []
for center in cluster_centers:
points = []
for _ in range(num_points_per_cluster):
offset_x = np.random.uniform(-1, 1)
offset_y = np.random.uniform(-1, 1)
point_x = center[0] + offset_x
point_y = center[1] + offset_y
points.append((point_x, point_y))
dataset.extend(points)
# 生成噪声点
num_noise_points = int(len(dataset) * noise_ratio)
noise_points = []
for _ in range(num_noise_points):
noise_x = np.random.uniform(*x_range)
noise_y = np.random.uniform(*y_range)
noise_points.append((noise_x, noise_y))
dataset.extend(noise_points)
return np.array(dataset)
colorLst = [
(1, 0, 0), # 红色
(0, 1, 0), # 绿色
(0, 0, 1), # 蓝色
(1, 1, 0), # 黄色
(1, 0, 1), # 品红色
(0, 1, 1), # 青色
(0.5, 0, 0), # 深红色
(0, 0.5, 0), # 深绿色
(0, 0, 0.5), # 深蓝色
(0.5, 0.5, 0.5), # 灰色
(0, 0, 0) # 白色
]
# 生成数据集
num_clusters = 4
num_points_per_cluster = 50
noise_ratio = 0.1
dataset = generate_dataset(num_clusters, num_points_per_cluster, noise_ratio)
eps = 1
min_samples = 4
dbscan = DBSCAN(eps, min_samples)
labels = dbscan.fit(dataset)
labels = map_label_to_color(labels)
plt.scatter(dataset[:, 0], dataset[:, 1], s=5, c=[colorLst[i] for i in labels])
plt.title('DBSCAN Test Dataset')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()
效果如下:
还记得 eps
和 min_samples
这两个参数吗,我们要想让算法有不错的效果,这个两个的值必须合适才可以。
min_samples
的取值:很简单,2 * 维度 即可eps
的取值:
eps
的值。这里我用的 KDTree 来快速找到第 k 近的点,代码如下,KDTree 的实现可以看我之前的博客:
import os
import numpy as np
import pandas as pd
from tqdm import tqdm
from model.kdTree import KDTree
from tools.visualization import draw_plots
csvPath = "./data/radar_3"
files = os.listdir(csvPath)
totDistLst = []
for i in tqdm(files):
df = pd.read_csv(os.path.join(csvPath, i))
points = np.array([df['u'].tolist(), df['v'].tolist(), df['z'].tolist()]).T
kdt = KDTree(points, 4)
for j in range(points.shape[0]):
points, dist = kdt.search_nearest(j, isInner=True)
totDist = -dist[0]
totDistLst.append(totDist)
totDistLst = sorted(totDistLst)
draw_plots("k-distance", totDistLst)
从下图可以看出,如果点的数量很多的话,拐点还是不太好确定的,可以通过分段的方式画出多个区间的图,再去具体的判断 eps
取值,比如在这个整体的图上看感觉拐点是在 250 的位置,但是如果把所有的点分成几部分分开来看的话,会发现取 50 左右才是最合适的值。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。