当前位置:   article > 正文

多元统计分析课程论文-聚类效果评价

多元统计分析课程论文-聚类效果评价

数据集来源:Unsupervised Learning on Country Data (kaggle.com)

代码参考:Clustering: PCA| K-Means - DBSCAN - Hierarchical | | Kaggle

基于特征合成降维和主成分分析法降维的国家数据集聚类效果评价

目录

1.特征合成降维

2.PCA降维

3.K-Means聚类

3.1 对特征合成降维的数据聚类分析

3.2 对PCA降维的数据聚类分析


        摘要:本文主要探讨了特征合成降维和主成分分析法(PCA)降维在K-Means聚类中的效果评价。数据来源于HELP国际人道主义组织提供的168个国家的社会经济和健康领域的数据集,通过特征合成和PCA方法进行降维处理,再用K-Means聚类分析进行聚类,并使用轮廓系数对两种降维方法的数据集聚类效果进行评价。结果显示,特征合成降维的数据集的聚类效果优于PCA降维的数据集。尽管PCA降维保留了95.8%的原始信息,但其聚类效果较差,可能是由于数据失去原有结构等原因。

数据集变量及其解释

变量名

描述

country

国家名称

child_mort

每1000例活产婴儿中,5岁以下儿童死亡人数

exports

人均商品和服务出口。以占人均GDP的百分比给出

health

人均医疗总支出。以占人均GDP的百分比给出

imports

人均进口商品和服务。以占人均GDP的百分比给出

Income

人均净收入

Inflation

通货膨胀:衡量国内生产总值的年增长率

life_expec

寿命:按照目前的死亡率模式,新生儿的平均寿命

total_fer

按当前的年龄-生育率,每个妇女将生下的子女数量

gdpp

人均国内生产总值。以国内生产总值除以总人口计算

  1. import pandas as pd
  2. import numpy as np
  3. import matplotlib.pyplot as plt
  4. %matplotlib inline
  5. import seaborn as sns
  6. pd.options.display.float_format = '{:.2f}'.format
  7. import warnings
  8. warnings.filterwarnings('ignore')
  9. from sklearn.cluster import KMeans
  10. from sklearn.metrics import silhouette_score
  11. from mpl_toolkits.mplot3d import Axes3D
  12. import plotly.express as px
  13. import kaleido
  14. data = pd.read_csv(r'F:\Jupyter Files\Practice\kaggle-聚类\Country-data.csv')
  1. plt.rcParams['font.family'] = ['sans-serif']
  2. plt.rcParams['font.sans-serif']=['Microsoft YaHei']
  3. ut = np.triu(data.corr())
  4. lt = np.tril(data.corr())
  5. colors = ['#FF781F','#2D2926']
  6. fig,ax = plt.subplots(nrows = 1, ncols = 2,figsize = (15,5))
  7. plt.subplot(1,2,1)
  8. sns.heatmap(data.corr(),cmap = colors,annot = True,cbar = 'True',mask = ut);
  9. plt.title('相关系数矩阵:上三角格式');
  10. plt.subplot(1,2,2)
  11. sns.heatmap(data.corr(),cmap = colors,annot = True,cbar = 'True',mask = lt);
  12. plt.title('相关矩阵:下三角格式');

1.特征合成降维

变量合成规则表

类别

合并变量

健康类

child_mort,Health, life_expecc, total_fer

贸易类

exports, imports

金融类

Income, Inflation, gdpp

  1. df1 = pd.DataFrame()
  2. df1['健康类'] = (data['child_mort'] / data['child_mort'].mean()) + (data['health'] / data['health'].mean()) +(data['life_expec'] / data['life_expec'].mean()) + (data['total_fer'] / data['total_fer'].mean())
  3. df1['贸易类'] = (data['imports'] / data['imports'].mean()) + (data['exports'] / data['exports'].mean())
  4. df1['经济类'] = (data['income'] / data['income'].mean()) + (data['inflation'] / data['inflation'].mean()) + (data['gdpp'] / data['gdpp'].mean())
  5. fig,ax = plt.subplots(nrows = 1,ncols = 1,figsize = (5,5))
  6. plt.subplot(1,1,1)
  7. sns.heatmap(df1.describe().T[['mean']],cmap = 'Oranges',annot = True,fmt = '.2f',linecolor = 'black',linewidths = 0.4,cbar = False);
  8. plt.title('Mean Values');
  9. fig.tight_layout(pad = 4)

  1. col = list(df1.columns)
  2. numerical_features = [*col]
  3. fig, ax = plt.subplots(nrows = 1,ncols = 3,figsize = (12,4))
  4. for i in range(len(numerical_features)):
  5. plt.subplot(1,3,i+1)
  6. sns.distplot(df1[numerical_features[i]],color = colors[0])
  7. title = '变量 : ' + numerical_features[i]
  8. plt.title(title)
  9. plt.show()

  1. #归一化处理
  2. from sklearn.preprocessing import MinMaxScaler,StandardScaler
  3. mms = MinMaxScaler() # Normalization
  4. ss = StandardScaler() # Standardization
  5. df1['健康类'] = mms.fit_transform(df1[['健康类']])
  6. df1['贸易类'] = mms.fit_transform(df1[['贸易类']])
  7. df1['经济类'] = mms.fit_transform(df1[['经济类']])
  8. df1.insert(loc = 0, value = list(data['country']), column = 'Country')
  9. df1.head()
Country健康类贸易类经济类
0Afghanistan0.630.140.08
1Albania0.130.200.09
2Algeria0.180.190.21
3Angola0.660.280.24
4Antigua and Barbuda0.120.280.15

2.PCA降维

  1. col = list(data.columns)
  2. col.remove('country')
  3. categorical_features = ['country']
  4. numerical_features = [*col]
  5. print('Categorical Features :',*categorical_features)#分类型变量
  6. print('Numerical Features :',*numerical_features)#数据型变量
Categorical Features : country
Numerical Features : child_mort exports health imports income inflation life_expec total_fer gdpp
  1. fig, ax = plt.subplots(nrows = 3,ncols = 3,figsize = (15,15))
  2. for i in range(len(numerical_features)):
  3. plt.subplot(3,3,i+1)
  4. sns.distplot(data[numerical_features[i]],color = colors[0])
  5. title = numerical_features[i]
  6. plt.show()

  1. #对health变量做标准化处理,对其余变量进行归一化处理
  2. df2 = data.copy(deep = True)
  3. col = list(data.columns)
  4. col.remove('health'); col.remove('country')
  5. df2['health'] = ss.fit_transform(df2[['health']]) # Standardization
  6. for i in col:
  7. df2[i] = mms.fit_transform(df2[[i]]) # Normalization
  8. df2.drop(columns = 'country',inplace = True)

利用 SPSS 软件对处理后的数据进行检验,由表3得,KMO值为 0.678(>0.5),达到主成分分析的标准,且 Bartlett检验显著性水平值为 0.000 小于 0.05,说明样本数据适宜做主成分分析。

  1. from sklearn.decomposition import PCA
  2. pca = PCA()
  3. pca_df2 = pd.DataFrame(pca.fit_transform(df2))
  4. pca.explained_variance_
array([1.01740511, 0.13090418, 0.03450018, 0.02679822, 0.00979752,
       0.00803398, 0.00307055, 0.00239976, 0.00179388])
  1. fig,ax = plt.subplots(nrows = 1,ncols = 1,figsize = (10,5),dpi=80)
  2. plt.step(list(range(1,10)), np.cumsum(pca.explained_variance_ratio_))
  3. # plt.plot(np.cumsum(pca.explained_variance_ratio_))
  4. plt.xlabel('主成分个数')
  5. plt.ylabel('主成分累计贡献率')
  6. plt.show()

3.K-Means聚类

  1. m1 = df1.drop(columns = ['Country']).values # Feature Combination : Health - Trade - Finance
  2. m2 = pca_df2.values # PCA Data
3.1 对特征合成降维的数据聚类分析
  1. sse = {};sil = [];kmax = 10
  2. fig = plt.subplots(nrows = 1, ncols = 2, figsize = (20,5))
  3. # Elbow Method 肘部法则:
  4. plt.subplot(1,2,1)
  5. for k in range(1, 10):
  6. kmeans = KMeans(n_clusters=k, max_iter=1000).fit(m1)
  7. sse[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center
  8. sns.lineplot(x = list(sse.keys()), y = list(sse.values()));
  9. plt.title('Elbow Method')
  10. plt.xlabel("k : Number of cluster")
  11. plt.ylabel("Sum of Squared Error")
  12. plt.grid()
  13. # Silhouette Score Method
  14. plt.subplot(1,2,2)
  15. for k in range(2, kmax + 1):
  16. kmeans = KMeans(n_clusters = k).fit(m1)
  17. labels = kmeans.labels_
  18. sil.append(silhouette_score(m1, labels, metric = 'euclidean'))
  19. sns.lineplot(x = range(2,kmax + 1), y = sil);
  20. plt.title('Silhouette Score Method')
  21. plt.xlabel("k : Number of cluster")
  22. plt.ylabel("Silhouette Score")
  23. plt.grid()
  24. plt.show()

  1. model = KMeans(n_clusters = 3,max_iter = 1000,algorithm="elkan")
  2. model.fit(m1)
  3. cluster = model.cluster_centers_
  4. centroids = np.array(cluster)
  5. labels = model.labels_
  6. data['Class'] = labels; df1['Class'] = labels
  7. fig = plt.figure(dpi=100)
  8. ax = Axes3D(fig)
  9. x = np.array(df1['健康类'])
  10. y = np.array(df1['贸易类'])
  11. z = np.array(df1['经济类'])
  12. ax.scatter(centroids[:,0],centroids[:,1],centroids[:,2],marker="X", color = 'b')
  13. ax.scatter(x,y,z,c = y)
  14. plt.title('健康类-贸易类-经济类数据聚类结果可视化')
  15. ax.set_xlabel('健康类')
  16. ax.set_ylabel('贸易类')
  17. ax.set_zlabel('经济类')
  18. plt.show();

  1. fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (15,5))
  2. plt.subplot(1,2,1)
  3. sns.boxplot(x = 'Class', y = 'child_mort', data = data, color = '#FF781F');
  4. plt.title('child_mort vs Class')
  5. plt.subplot(1,2,2)
  6. sns.boxplot(x = 'Class', y = 'income', data = data, color = '#FF781F');
  7. plt.title('income vs Class')
  8. plt.show()

  1. df1['Class'].loc[df1['Class'] == 0] = 'Might Need Help'
  2. df1['Class'].loc[df1['Class'] == 1] ='No Help Needed'
  3. df1['Class'].loc[df1['Class'] == 2] = 'Help Needed'
  4. fig = px.choropleth(df1[['Country','Class']],
  5. locationmode = 'country names',
  6. locations = 'Country',
  7. title = 'Needed Help Per Country (World)',
  8. color = df1['Class'],
  9. color_discrete_map = {'Help Needed':'Red',
  10. 'No Help Needed':'Green',
  11. 'Might Need Help':'Yellow'}
  12. )
  13. fig.update_geos(fitbounds = "locations", visible = True)
  14. fig.update_layout(legend_title_text = 'Labels',legend_title_side = 'top',title_pad_l = 260,title_y = 0.86)
  15. fig.show(engine = 'kaleido')

3.2 对PCA降维的数据聚类分析
  1. sse = {};sil = [];kmax = 10
  2. fig = plt.subplots(nrows = 1, ncols = 2, figsize = (20,5))
  3. # Elbow Method 肘部法则 :
  4. plt.subplot(1,2,1)
  5. for k in range(1, 10):
  6. kmeans = KMeans(n_clusters=k, max_iter=1000).fit(m2)
  7. sse[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center
  8. sns.lineplot(x = list(sse.keys()), y = list(sse.values()));
  9. plt.title('Elbow Method')
  10. plt.xlabel("k : Number of cluster")
  11. plt.ylabel("Sum of Squared Error")
  12. plt.grid()
  13. # Silhouette Score Method
  14. plt.subplot(1,2,2)
  15. for k in range(2, kmax + 1):
  16. kmeans = KMeans(n_clusters = k).fit(m2)
  17. labels = kmeans.labels_
  18. sil.append(silhouette_score(m2, labels, metric = 'euclidean'))
  19. sns.lineplot(x = range(2,kmax + 1), y = sil);
  20. plt.title('Silhouette Score Method')
  21. plt.xlabel("k : Number of cluster")
  22. plt.ylabel("Silhouette Score")
  23. plt.grid()
  24. plt.show()

  1. model = KMeans(n_clusters = 3,max_iter = 1000,algorithm="elkan")
  2. model.fit(m2)
  3. cluster = model.cluster_centers_
  4. centroids = np.array(cluster)
  5. labels = model.labels_
  6. data['Class'] = labels; pca_df2['Class'] = labels
  7. fig = plt.figure(dpi=100)
  8. ax = Axes3D(fig)
  9. ax.scatter(centroids[:,0],centroids[:,1],centroids[:,2],marker="X", color = 'b')
  10. plt.title('PCA降维数据聚类结果可视化')
  11. ax.set_xlabel('第一主成分')
  12. ax.set_ylabel('第二主成分')
  13. ax.set_zlabel('第三主成分')
  14. ax.scatter(x,y,z,c = y)
  15. plt.show();

  1. fig, ax = plt.subplots(nrows = 1, ncols = 2, figsize = (15,5))
  2. plt.subplot(1,2,1)
  3. sns.boxplot(x = 'Class', y = 'child_mort', data = data, color = '#FF781F');
  4. plt.title('child_mort vs Class')
  5. plt.subplot(1,2,2)
  6. sns.boxplot(x = 'Class', y = 'income', data = data, color = '#FF781F');
  7. plt.title('income vs Class')
  8. plt.show()

  1. pca_df2['Class'].loc[pca_df2['Class'] == 0] = 'Might Need Help'
  2. pca_df2['Class'].loc[pca_df2['Class'] == 1] = 'No Help Needed'
  3. pca_df2['Class'].loc[pca_df2['Class'] == 2] = 'Help Needed'
  4. fig = px.choropleth(pca_df2[['Country','Class']],
  5. locationmode = 'country names',
  6. locations = 'Country',
  7. title = 'Needed Help Per Country (World)',
  8. color = pca_df2['Class'],
  9. color_discrete_map = {'Help Needed':'Red',
  10. 'Might Need Help':'Yellow',
  11. 'No Help Needed': 'Green'})
  12. fig.update_geos(fitbounds = "locations", visible = True)
  13. fig.update_layout(legend_title_text = 'Labels',legend_title_side = 'top',title_pad_l = 260,title_y = 0.86)
  14. fig.show(engine = 'kaleido')

3.3 轮廓系数效果评价

        轮廓系数是一种用于评估聚类效果的指标。它是对每个样本来定义的,它能够同时衡量样本与其自身所在的簇中的其他样本的相似度a和样本与其他簇中的样本的相似度b,其中,a等于样本与同一簇中所有其他点之间的平均距离;b等于样本与下一个最近的簇中得所有点之间的平均距离。单个样本的轮廓系数计算为:

根据聚类的要求“簇内差异小,簇外差异大”,当轮廓系数越接近1表示样本与自己所在的簇中的样本很相似,并且与其他簇中的样本不相似。如果一个簇中的大多数样本具有比较高的轮廓系数,则簇会有较高的总轮廓系数,即整个数据集的平均轮廓系数越高,则聚类效果是合适的。

  1. #特征合成降维的数据集
  2. cluster_1=KMeans(n_clusters=3,random_state=0).fit(m1)
  3. silhouette_score(m1,cluster_1.labels_) #0.452
  1. #PCA降维的数据集
  2. cluster_2=KMeans(n_clusters=3,random_state=0).fit(m2)
  3. silhouette_score(m2,cluster_2.labels_) #0.384
两种降维方法数据的轮廓系数

特征合成降维的数据集

PCA降维的数据集

轮廓系数

0.452

0.384

ps:低价出课程论文-多元统计分析论文、R语言论文、stata计量经济学课程论文(论文+源代码+数据集)

声明:本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号