当前位置:   article > 正文

python sklearn无监督数据分析测试_from sklearn.metrics.pairwise import pairwise_dist

from sklearn.metrics.pairwise import pairwise_distances

数据如下格式,这些数据没有标签,需求是想把它们做个聚类,但又不知道可以分几类,那就只能尝试下无监督的方法。

 

 

python的sklearn包中有很多无监督聚类方法,下面先做下简单的测试,代码如下

  1. from sklearn.cluster import KMeans, Birch, DBSCAN, MeanShift, estimate_bandwidth, SpectralClustering
  2. from sklearn import metrics
  3. from sklearn.metrics.pairwise import pairwise_distances
  4. from scipy.spatial.distance import pdist, squareform
  5. import numpy as np
  6. import matplotlib.pyplot as plt
  7. import pandas as pd
  8. import random
  9. from canopy import Canopy
  10. dat = pd.read_csv("./kuse.csv")
  11. data = dat.T
  12. # data = data.drop([''])
  13. # print(len(data))
  14. x = []
  15. count = 0
  16. while True:
  17. try:
  18. x.append(data[count].values.tolist())
  19. count += 1
  20. except:
  21. break
  22. X = np.array(x)
  23. print('\n')
  24. print('#######################K-Means########################')
  25. km = KMeans(n_clusters=5).fit(X)
  26. # 标签结果
  27. rs_labels = km.labels_
  28. # 每个类别的中心点
  29. rs_center_ids = km.cluster_centers_
  30. print(rs_center_ids)
  31. print(rs_labels)
  32. km_y = KMeans(n_clusters=5).fit_predict(X)
  33. print("Calinski-Harabasz Score", metrics.calinski_harabasz_score(X, km_y))
  34. print('###############################################')
  35. print('\n')
  36. print('#########################Birch######################')
  37. bc = Birch(threshold=0.4, branching_factor=30, n_clusters=5, compute_labels=True, copy=True).fit(X)
  38. # 标签结果
  39. rs_labels = bc.subcluster_labels_
  40. # 每个类别的中心点
  41. rs_center_ids = bc.subcluster_centers_
  42. all_labels = bc.labels_
  43. print(rs_center_ids)
  44. print(rs_labels)
  45. bc_y = Birch(threshold=0.4, branching_factor=30, n_clusters=5, compute_labels=True, copy=True).fit_predict(X)
  46. print("Calinski-Harabasz Score", metrics.calinski_harabasz_score(X, bc_y))
  47. print('###############################################')
  48. print('\n')
  49. print('######################DBSCAN#########################')
  50. dbscan = DBSCAN(eps=0.8, min_samples=5, metric='euclidean', metric_params=None, algorithm='ball_tree', leaf_size=30, p=None, n_jobs=1).fit(X)
  51. #'auto', 'ball_tree', 'kd_tree', 'brute'
  52. all_labels = dbscan.labels_
  53. print(all_labels)
  54. dbscan_y = DBSCAN(eps=0.8, min_samples=5, metric='euclidean', metric_params=None, algorithm='ball_tree', leaf_size=30, p=None, n_jobs=1).fit_predict(X)
  55. print("Calinski-Harabasz Score", metrics.calinski_harabasz_score(X, dbscan_y))
  56. print('###############################################')
  57. print('\n')
  58. print('#######################MeanShift########################')
  59. bandwidth = estimate_bandwidth(X, quantile = 0.3, n_samples = None)
  60. ms = MeanShift(bandwidth = bandwidth, bin_seeding = True, max_iter=500)
  61. ms.fit(X)
  62. labels = ms.labels_
  63. cluster_centers = ms.cluster_centers_
  64. print(cluster_centers)
  65. ms_y = ms.fit_predict(X)
  66. print("Calinski-Harabasz Score", metrics.calinski_harabasz_score(X, ms_y))
  67. print('###############################################')
  68. print('\n')
  69. print('#######################SpectralClustering########################')
  70. sc = SpectralClustering(n_clusters=5, assign_labels='discretize', random_state=0).fit(X)
  71. #'auto', 'ball_tree', 'kd_tree', 'brute'
  72. all_labels = sc.labels_
  73. print(all_labels)
  74. sc_y = SpectralClustering(n_clusters=5, assign_labels='discretize', random_state=0).fit_predict(X)
  75. print("Calinski-Harabasz Score", metrics.calinski_harabasz_score(X, sc_y))
  76. print('###############################################')
  77. scores = []
  78. s = dict()
  79. for index, gamma in enumerate((0.01, 0.1, 1, 5)):
  80. for index, k in enumerate((3, 4, 5, 6, 7)):
  81. pred_y = SpectralClustering(n_clusters=k).fit_predict(X)
  82. print("Calinski-Harabasz Score with gamma=", gamma, "n_cluster=", k, "score=",
  83. metrics.calinski_harabasz_score(X.data, pred_y))
  84. tmp = dict()
  85. tmp['gamma'] = gamma
  86. tmp['n_cluster'] = k
  87. tmp['score'] = metrics.calinski_harabasz_score(X, pred_y)
  88. s[metrics.calinski_harabasz_score(X, pred_y)] = tmp
  89. scores.append(metrics.calinski_harabasz_score(X, pred_y))
  90. print(np.max(scores))
  91. print("max score:")
  92. print(s.get(np.max(scores)))

说明:有些人肯定会疑惑,既然是无监督,比如KMeans居然还要给出类别数,那我怎么知道一开始给几类,那你可以参考这个博客的做法

sklearn谱聚类Spectral Clustering(一)运行:以coco标签为例_祥瑞的技术博客-CSDN博客背景:我们需要对多标签的问题,标签进行谱聚类,然后看相应的聚类结果。官方API描述:https://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html#sklearn.cluster.SpectralClustering目录一、安装sklearn1.1 scikit-le...https://blog.csdn.net/weixin_36474809/article/details/89855869

 用迭代的方式配合评价方法metrics.calinski_harabasz_score,得分最高的情况下的类别数就是最佳类别数,具体就是类似于这段代码:

 你遍历了最有可能的几个参数就可以了,你会发现一个峰值,除了那个峰值对应的参数,其它参数肯定不行的,得分较低。可以参考我最后给出的结果部分

学习这些的方式:

学习已有的包最好的办法当然是直接去官网,链接:

API Reference — scikit-learn 1.0.1 documentationhttps://scikit-learn.org/stable/modules/classes.html#module-sklearn.cluster打开以后就是官网提供的那些方法

以上这些方法全是无监督聚类方法, 随便点开一个方法后里面会有详细的介绍,几乎不用你再去查怎么用

 接着就是这个算法包含的属性,有些是你可能会用到的,比如下面红色圈出来的是调用这个类以后的聚类中心属性,具体得到聚类中心怎么得到可以往后看实践部分。

 再往后看就是算法类包含的方法,还给了例子,所有算法都是这个套路,你会用一个别的就都会用,具体观察下代码就知道了。

 

 下面是各个算法的分类结果还有官方的评价指标评价的结果,Calinski-Harabasz Score这个值越大说明聚类越好。

  1. #######################K-Means########################
  2. [[0.10238661 0.04652876 0.64257175 0.56025887 0.04940799 0.44693049
  3. 0.44693049 0.30257996 3. 0.30351089]
  4. [0.03910997 0.05882337 0.66510772 0.13391711 0.02917814 0.42322102
  5. 0.42322102 0.30111248 1. 0.20353977]
  6. [0.07116973 0.11768102 0.29993584 0.14146421 0.06498572 0.29048064
  7. 0.29048064 0.51616101 5. 0.20508759]
  8. [0.08476444 0.10926822 0.28438275 0.19259887 0.07765074 0.42020786
  9. 0.42020786 0.27644649 2. 0.20605626]
  10. [0.03629227 0.06192236 0.6608674 0.1130219 0.03068587 0.27501765
  11. 0.27501765 0.57433743 4. 0.20771638]]
  12. [0 2 4 ... 2 2 2]
  13. Calinski-Harabasz Score 93795.02483707225
  14. ###############################################
  15. #########################Birch######################
  16. [[0.10238661 0.04652876 0.64257175 0.56025887 0.04940799 0.44693049
  17. 0.44693049 0.30257996 3. 0.30351089]
  18. [0.07116973 0.11768102 0.29993584 0.14146421 0.06498572 0.29048064
  19. 0.29048064 0.51616101 5. 0.20508759]
  20. [0.03629227 0.06192236 0.6608674 0.1130219 0.03068587 0.27501765
  21. 0.27501765 0.57433743 4. 0.20771638]
  22. [0.03910997 0.05882337 0.66510772 0.13391711 0.02917814 0.42322102
  23. 0.42322102 0.30111248 1. 0.20353977]
  24. [0.08476444 0.10926822 0.28438275 0.19259887 0.07765074 0.42020786
  25. 0.42020786 0.27644649 2. 0.20605626]]
  26. [2 4 3 1 0]
  27. Calinski-Harabasz Score 93795.02483707225
  28. ###############################################
  29. ######################DBSCAN#########################
  30. [0 1 2 ... 1 1 1]
  31. Calinski-Harabasz Score 75670.64330246476
  32. ###############################################
  33. #######################MeanShift########################
  34. [[0.05766018 0.09578076 0.43784817 0.13046477 0.0517502 0.28505013
  35. 0.28505013 0.53773785 4.61677161 0.20477811]
  36. [0.08411639 0.10860097 0.28453402 0.19230271 0.07725466 0.42016314
  37. 0.42016314 0.27657423 2. 0.20500095]
  38. [0.03889024 0.05820553 0.66499059 0.13398581 0.02920016 0.42325589
  39. 0.42325589 0.301198 1. 0.20301746]
  40. [0.09761513 0.044748 0.64576806 0.55686585 0.04546649 0.44765525
  41. 0.44765525 0.30046192 3. 0.29609165]
  42. [0.96625754 0.38837888 0.46391063 0.89552544 0.58666431 0.19630818
  43. 0.19630818 0.69882809 3. 1.277805 ]]
  44. Calinski-Harabasz Score 35545.65353257847
  45. ###############################################
  46. #######################SpectralClustering########################
  47. [3 4 1 ... 4 4 4]
  48. Calinski-Harabasz Score 93795.02483707225
  49. ###############################################
  50. Calinski-Harabasz Score with gamma= 0.01 n_cluster= 3 score= 45354.4481194343
  51. Calinski-Harabasz Score with gamma= 0.01 n_cluster= 4 score= 58594.380870988265
  52. Calinski-Harabasz Score with gamma= 0.01 n_cluster= 5 score= 93795.02483707225
  53. Calinski-Harabasz Score with gamma= 0.01 n_cluster= 6 score= 64379.22511356569
  54. Calinski-Harabasz Score with gamma= 0.01 n_cluster= 7 score= 54295.142816210384
  55. Calinski-Harabasz Score with gamma= 0.1 n_cluster= 3 score= 45354.44811943432
  56. Calinski-Harabasz Score with gamma= 0.1 n_cluster= 4 score= 58594.38087098827
  57. Calinski-Harabasz Score with gamma= 0.1 n_cluster= 5 score= 93795.02483707225
  58. Calinski-Harabasz Score with gamma= 0.1 n_cluster= 6 score= 64435.39142011564
  59. Calinski-Harabasz Score with gamma= 0.1 n_cluster= 7 score= 54337.07322606031
  60. Calinski-Harabasz Score with gamma= 1 n_cluster= 3 score= 45354.4481194343
  61. Calinski-Harabasz Score with gamma= 1 n_cluster= 4 score= 58594.38087098827
  62. Calinski-Harabasz Score with gamma= 1 n_cluster= 5 score= 93795.02483707225
  63. Calinski-Harabasz Score with gamma= 1 n_cluster= 6 score= 64435.39142011564
  64. Calinski-Harabasz Score with gamma= 1 n_cluster= 7 score= 54295.142816210384
  65. Calinski-Harabasz Score with gamma= 5 n_cluster= 3 score= 45354.4481194343
  66. Calinski-Harabasz Score with gamma= 5 n_cluster= 4 score= 58594.38087098827
  67. Calinski-Harabasz Score with gamma= 5 n_cluster= 5 score= 93795.02483707225
  68. Calinski-Harabasz Score with gamma= 5 n_cluster= 6 score= 64435.391420115644
  69. Calinski-Harabasz Score with gamma= 5 n_cluster= 7 score= 54295.14281621037
  70. 93795.02483707225
  71. max score:
  72. {'gamma': 5, 'n_cluster': 5, 'score': 93795.02483707225}

结果分析:通过对这几个方法的观察发现,最佳的类别数是5类,每个方法得到的结果很一致,就连聚类中心点也一摸一样,唯一不同的是,这些方法具体到每个数据的预测结果时在类别的划分上会有所区别,比如某个数据在KMeans时被划分为第一类,而在用meansift方法时却被分为了第5类。具体怎么验证对错那肯定是要把所有数据都预测了,然后根据你自己的需求判定用哪个方法

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/weixin_40725706/article/detail/142818
推荐阅读
相关标签
  

闽ICP备14008679号