赞
踩
1. 定义:K均值聚类是一种常见的无监督聚类算法,该方法通过迭代的方式寻找k个簇,使得各个样本距离所属簇中心点的误差平方和最小。
2. 步骤:
1. 从初始样本中选取k个样本作为初始聚类中心;
2. 计算每个聚类对象到聚类中心的距离;
3. 将所有聚类对象划分到最近的聚类中心;
4. 重新计算聚类中心;
5. 重复步骤2~4,直到新聚类中心和上一次聚类中心不发生改变或达到最大迭代次数时,停止更新
3. MALAB实现:
- clc;
- clear;
- close all;
- % 输入数据
- data = [0.697 0.460;0.774,0.376;0.634,0.264;0.608,0.318;0.556,0.215;0.403,0.237;
- 0.481,0.149;0.437,0.211;0.666,0.091;0.243,0.267;0.245,0.057;0.343,0.099;
- 0.639 0.161;0.657,0.198;0.360,0.370;0.593,0.042;0.719,0.103;0.359,0.188;
- 0.339,0.241;0.282,0.257;0.748,0.232;0.714,0.346;0.483,0.312;0.478,0.437;
- 0.525,0.369;0.751,0.489;0.532,0.472;0.473,0.376;0.725,0.445;0.446,0.459;];
- [r,c] = size(data); % 输入数据行数和列数
- k = 10; % 划分为多少簇
- %% step 1 从初始数据中随机选择k个样本作为初始聚类中心
- center = data(randperm(r,k),:);
- %%
- ite = 0; % 初始迭代次数
- dis = inf; % 初始距离误差阈值
- while ite <= 1000 && dis >= 0.001
- %% step 2 计算每个距离对象到距离中心的距离
- D = pdist2(data,center); % D是r*k矩阵
- %% step 3 将所有聚类对象划分到最近的聚类中心
- [dmin,ind] = min(D,[],2); % 确定每个样本点距离哪个中心点最近
- new_data = [data,ind]; % 数据附带标签
- sort_data = sortrows(new_data,3); % 按标签从小到大进行排序
- for i = 1:k
- len(i) = length(find(ind == i)); % 确定样本点距离各个中心点的数量;
- end
- X = mat2cell(sort_data(:,1:c),len,c); % 将sort_data中每个同类划分为单个元胞
- %% step 4 重新计算聚类中心点
- new_center = cell2mat(cellfun(@(x) mean(x,1), X, 'UniformOutput',false)); % 计算每个类别中样本点各个维度的平均值
- dis = sqrt(mean(sum(sum(new_center-center).^2,2)));
- center = new_center;
- %% step 5 重复步骤2~4,直到新聚类中心和上一次聚类中心不发生改变或达到最大迭代次数时,停止更新
- ite = ite + 1;
- end
- %% 绘制
- colorArray = hsv(k); % 选取k个不同颜色区分k个簇
- colorLabel = colorArray(ind,:); % 构建颜色矩阵,同一类使用同一种颜色
- figure;
- plot(new_center(:,1),new_center(:,2),'kx','MarkerSize',10,'LineWidth',3) % 绘制中心点
- hold on; % 继续绘图
- scatter(data(:,1),data(:,2),40,colorLabel,'filled'); % 按照colorLabel绘制原始数据的散点图
实现结果:
4. 使用MATLAB自带k均值聚类函数:
- clc;
- clear;
- close all;
- % 输入数据
- data = [0.697 0.460;0.774,0.376;0.634,0.264;0.608,0.318;0.556,0.215;0.403,0.237;
- 0.481,0.149;0.437,0.211;0.666,0.091;0.243,0.267;0.245,0.057;0.343,0.099;
- 0.639 0.161;0.657,0.198;0.360,0.370;0.593,0.042;0.719,0.103;0.359,0.188;
- 0.339,0.241;0.282,0.257;0.748,0.232;0.714,0.346;0.483,0.312;0.478,0.437;
- 0.525,0.369;0.751,0.489;0.532,0.472;0.473,0.376;0.725,0.445;0.446,0.459;];
- [r,c] = size(data);
- k = 10;
- opts = statset('Display','final');
- [ind,C] = kmeans(data,k,'Distance','cityblock',...
- 'Replicates',5,'Options',opts); % 使用kmean,'cityblock'指曼哈顿距离;Replicates重复聚类次数;
- colorArray = hsv(k); % 选取k个不同颜色区分k个簇
- colorLabel = colorArray(ind,:); % 构建颜色矩阵,同一类使用同一种颜色
- figure;
- plot(C(:,1),C(:,2),'kx','MarkerSize',10,'LineWidth',3) % 绘制中心点
- hold on; % 继续绘图
- scatter(data(:,1),data(:,2),40,colorLabel,'filled'); % 按照colorLabel绘制原始数据的散点图
仿真结果:
以上两种实现方式,得到的结果是不一样的,可能原因是:一方面是初始质心(中心点)的选取不一样;另一方面是度量方式的不同,前者是欧式距离后者是曼哈顿距离;
总结:K均值聚类方法有以下不足:
1. 需要提前确定数据划分多少类;
2. 对初始选择的质心点敏感;
3. 对异常数据敏感。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。