赞
踩
参考:西瓜书
DBSCAN的思想是基于密度来聚类,十分直观易懂,更严谨的描述可见西瓜书,其中个人认为最关键的是:
若 x x x为核心对象,由 x x x密度可达的所有样本组成的集合记为 X = { x ′ ∈ D ∣ x ′ 由 x 密 度 可 达 } X=\{x' \in D \mid x'由x密度可达\} X={x′∈D∣x′由x密度可达},则不难证明 X X X即为满足连接性与最大性的簇。
这就指明了实现的一种思路:先找到所有的核心对象,再找到这些核心对象密度可达的其他点。
伪代码如下:
这里给出C++的实现,基本上忠于上述的伪代码,没有对性能进行优化:
struct clusterData { int coordinates[2]; //coordinate[0]:x, coordinate[1]:y int clusterIndex = 0; int dataType = 0; //0:noise,1:boundary,2:core }; /********* dbscan_cpp Summary: Density-Based Spatial Clustering of Applications with Noise inplemented in C++ Parameters: cluster: an array contains all points, points' clusterIndex and dataType are initiated as 0 totalPts: number of points in cluster Return: number of clusters *********/ int dbscan_cpp(clusterData *cluster, const int totalPts, const double eps, const unsigned int minPts) { vector<int> coreObj; vector<set<int>> neighbors(totalPts); for (int j = 0; j < totalPts; j++) { for (int i = 0; i < totalPts; i++) { double dist = sqrt(pow((cluster[j].coordinates[0] - cluster[i].coordinates[0]), 2) + pow((cluster[j].coordinate[1] - cluster[i].coordinate[1]), 2)); if (dist <= eps) neighbors[j].insert(i); } if (neighbors[j].size() >= minPts) coreObj.push_back(j); } set<int> unvisitedPts; for (int i = 0; i < totalPts; i++) unvisitedPts.insert(i); int k = 1; //the index of first cluster is 1, not 0 vector<set<int>> C; while (coreObj.size() > 0) { set<int> unvisitedPtsOld(unvisitedPts.begin(), unvisitedPts.end()); int omg = coreObj[0]; list<int> Q; Q.push_back(omg); unvisitedPts.erase(omg); while (Q.size() > 0) { int q = Q.front(); Q.remove(q); cluster[q].clusterIndex = k; if (neighbors[q].size() >= minPts) { cluster[q].dataType = 2; set<int> delta; set_intersection(unvisitedPts.begin(), unvisitedPts.end(), neighbors[q].begin(), neighbors[q].end(), inserter(delta, delta.begin())); Q.insert(Q.end(), delta.begin(), delta.end()); set<int> diff; set_difference(unvisitedPts.begin(), unvisitedPts.end(), delta.begin(), delta.end(), inserter(diff, diff.begin())); unvisitedPts.clear(); copy(diff.begin(), diff.end(), std::inserter(unvisitedPts, unvisitedPts.end())); } else cluster[q].dataType = 1; } k = k + 1; set<int> c; set_difference(unvisitedPtsOld.begin(), unvisitedPtsOld.end(), unvisitedPts.begin(), unvisitedPts.end(), inserter(c, c.begin())); C.push_back(c); set<int> diff; sort(coreObj.begin(), coreObj.end()); set_difference(coreObj.begin(), coreObj.end(), c.begin(), c.end(), inserter(diff, diff.begin())); coreObj.assign(diff.begin(), diff.end()); } return k-1; }
这里还有一份MATLAB的实现可供参考。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。