当前位置:   article > 正文

KMeans算法,采用肘部法则获取类簇中心个数K的值。_k-means肘部法

k-means肘部法

K-Means是一种非常常见的聚类算法,在处理聚类任务中经常使用,K-Means算法是一种原型聚类算法。

该算法重要的一步就是确定K的值的划分,通常我们采用肘部法则选取K值,再依据轮廓系数,及各个数据集中数据的数量综合去评估哪个K值为最佳。

 

肘部法则

 

1 )、对于n个点的数据集,迭代计算k from 1 to n,每次聚类完成后计算每个点到其所属的簇中心的距离的平方和;

2)、平方和是会逐渐变小的,直到k时平方和为0,因为每个点都是它所在的簇中心本身。

3)、在这个平方和变化过程中,会出现一个拐点也即“肘”点,下降率突然变缓时即认为 是最佳的k值。 在决定什么时候停止训练时,肘形判据同样有效,数据通常有更多的噪音,在增加分类无法带来更多回报时,停止增加类别。

 

轮廓系数法

针对聚类算法另一个评估指标: 轮廓系数法 ,结合了聚类的凝聚度(Cohesion)和分离度 (Separation),用于评估聚类的效果:

1)、计算样本i到同簇其他样本的平均距离ai,ai 越小样本i的簇内不相似度越小,说明样 本i越应该被聚类到该簇。

2)、计算样本i到最近簇Cj 的所有样本的平均距离bij,称样本i与最近簇Cj 的不相似度,定 义为样本i的簇间不相似度:bi =min{bi1, bi2, ..., bik},bi越大,说明样本i越不属于其他簇。

3)、求出所有样本的轮廓系数后再求平均值就得到了平均轮廓系数。平均轮廓系数的取 值范围为[-1,1],系数越大,聚类效果越好。簇内样本的距离越近,簇间样本距离越远。

 

下面案例针对鸢尾花数据集进行聚类,使用KMeans算法,采用肘部法则Elbow获取K的值,使用轮廓系数评估模型。

准备工作:

A: 数据准备 iris_kmeans.txt

  1. 1 1:5.1 2:3.5 3:1.4 4:0.2
  2. 1 1:4.9 2:3.0 3:1.4 4:0.2
  3. 1 1:4.7 2:3.2 3:1.3 4:0.2
  4. 1 1:4.6 2:3.1 3:1.5 4:0.2
  5. 1 1:5.0 2:3.6 3:1.4 4:0.2
  6. 1 1:5.4 2:3.9 3:1.7 4:0.4
  7. 1 1:4.6 2:3.4 3:1.4 4:0.3
  8. 1 1:5.0 2:3.4 3:1.5 4:0.2
  9. 1 1:4.4 2:2.9 3:1.4 4:0.2
  10. 1 1:4.9 2:3.1 3:1.5 4:0.1
  11. 1 1:5.4 2:3.7 3:1.5 4:0.2
  12. 1 1:4.8 2:3.4 3:1.6 4:0.2
  13. 1 1:4.8 2:3.0 3:1.4 4:0.1
  14. 1 1:4.3 2:3.0 3:1.1 4:0.1
  15. 1 1:5.8 2:4.0 3:1.2 4:0.2
  16. 1 1:5.7 2:4.4 3:1.5 4:0.4
  17. 1 1:5.4 2:3.9 3:1.3 4:0.4
  18. 1 1:5.1 2:3.5 3:1.4 4:0.3
  19. 1 1:5.7 2:3.8 3:1.7 4:0.3
  20. 1 1:5.1 2:3.8 3:1.5 4:0.3
  21. 1 1:5.4 2:3.4 3:1.7 4:0.2
  22. 1 1:5.1 2:3.7 3:1.5 4:0.4
  23. 1 1:4.6 2:3.6 3:1.0 4:0.2
  24. 1 1:5.1 2:3.3 3:1.7 4:0.5
  25. 1 1:4.8 2:3.4 3:1.9 4:0.2
  26. 1 1:5.0 2:3.0 3:1.6 4:0.2
  27. 1 1:5.0 2:3.4 3:1.6 4:0.4
  28. 1 1:5.2 2:3.5 3:1.5 4:0.2
  29. 1 1:5.2 2:3.4 3:1.4 4:0.2
  30. 1 1:4.7 2:3.2 3:1.6 4:0.2
  31. 1 1:4.8 2:3.1 3:1.6 4:0.2
  32. 1 1:5.4 2:3.4 3:1.5 4:0.4
  33. 1 1:5.2 2:4.1 3:1.5 4:0.1
  34. 1 1:5.5 2:4.2 3:1.4 4:0.2
  35. 1 1:4.9 2:3.1 3:1.5 4:0.1
  36. 1 1:5.0 2:3.2 3:1.2 4:0.2
  37. 1 1:5.5 2:3.5 3:1.3 4:0.2
  38. 1 1:4.9 2:3.1 3:1.5 4:0.1
  39. 1 1:4.4 2:3.0 3:1.3 4:0.2
  40. 1 1:5.1 2:3.4 3:1.5 4:0.2
  41. 1 1:5.0 2:3.5 3:1.3 4:0.3
  42. 1 1:4.5 2:2.3 3:1.3 4:0.3
  43. 1 1:4.4 2:3.2 3:1.3 4:0.2
  44. 1 1:5.0 2:3.5 3:1.6 4:0.6
  45. 1 1:5.1 2:3.8 3:1.9 4:0.4
  46. 1 1:4.8 2:3.0 3:1.4 4:0.3
  47. 1 1:5.1 2:3.8 3:1.6 4:0.2
  48. 1 1:4.6 2:3.2 3:1.4 4:0.2
  49. 1 1:5.3 2:3.7 3:1.5 4:0.2
  50. 1 1:5.0 2:3.3 3:1.4 4:0.2
  51. 2 1:7.0 2:3.2 3:4.7 4:1.4
  52. 2 1:6.4 2:3.2 3:4.5 4:1.5
  53. 2 1:6.9 2:3.1 3:4.9 4:1.5
  54. 2 1:5.5 2:2.3 3:4.0 4:1.3
  55. 2 1:6.5 2:2.8 3:4.6 4:1.5
  56. 2 1:5.7 2:2.8 3:4.5 4:1.3
  57. 2 1:6.3 2:3.3 3:4.7 4:1.6
  58. 2 1:4.9 2:2.4 3:3.3 4:1.0
  59. 2 1:6.6 2:2.9 3:4.6 4:1.3
  60. 2 1:5.2 2:2.7 3:3.9 4:1.4
  61. 2 1:5.0 2:2.0 3:3.5 4:1.0
  62. 2 1:5.9 2:3.0 3:4.2 4:1.5
  63. 2 1:6.0 2:2.2 3:4.0 4:1.0
  64. 2 1:6.1 2:2.9 3:4.7 4:1.4
  65. 2 1:5.6 2:2.9 3:3.6 4:1.3
  66. 2 1:6.7 2:3.1 3:4.4 4:1.4
  67. 2 1:5.6 2:3.0 3:4.5 4:1.5
  68. 2 1:5.8 2:2.7 3:4.1 4:1.0
  69. 2 1:6.2 2:2.2 3:4.5 4:1.5
  70. 2 1:5.6 2:2.5 3:3.9 4:1.1
  71. 2 1:5.9 2:3.2 3:4.8 4:1.8
  72. 2 1:6.1 2:2.8 3:4.0 4:1.3
  73. 2 1:6.3 2:2.5 3:4.9 4:1.5
  74. 2 1:6.1 2:2.8 3:4.7 4:1.2
  75. 2 1:6.4 2:2.9 3:4.3 4:1.3
  76. 2 1:6.6 2:3.0 3:4.4 4:1.4
  77. 2 1:6.8 2:2.8 3:4.8 4:1.4
  78. 2 1:6.7 2:3.0 3:5.0 4:1.7
  79. 2 1:6.0 2:2.9 3:4.5 4:1.5
  80. 2 1:5.7 2:2.6 3:3.5 4:1.0
  81. 2 1:5.5 2:2.4 3:3.8 4:1.1
  82. 2 1:5.5 2:2.4 3:3.7 4:1.0
  83. 2 1:5.8 2:2.7 3:3.9 4:1.2
  84. 2 1:6.0 2:2.7 3:5.1 4:1.6
  85. 2 1:5.4 2:3.0 3:4.5 4:1.5
  86. 2 1:6.0 2:3.4 3:4.5 4:1.6
  87. 2 1:6.7 2:3.1 3:4.7 4:1.5
  88. 2 1:6.3 2:2.3 3:4.4 4:1.3
  89. 2 1:5.6 2:3.0 3:4.1 4:1.3
  90. 2 1:5.5 2:2.5 3:4.0 4:1.3
  91. 2 1:5.5 2:2.6 3:4.4 4:1.2
  92. 2 1:6.1 2:3.0 3:4.6 4:1.4
  93. 2 1:5.8 2:2.6 3:4.0 4:1.2
  94. 2 1:5.0 2:2.3 3:3.3 4:1.0
  95. 2 1:5.6 2:2.7 3:4.2 4:1.3
  96. 2 1:5.7 2:3.0 3:4.2 4:1.2
  97. 2 1:5.7 2:2.9 3:4.2 4:1.3
  98. 2 1:6.2 2:2.9 3:4.3 4:1.3
  99. 2 1:5.1 2:2.5 3:3.0 4:1.1
  100. 2 1:5.7 2:2.8 3:4.1 4:1.3
  101. 3 1:6.3 2:3.3 3:6.0 4:2.5
  102. 3 1:5.8 2:2.7 3:5.1 4:1.9
  103. 3 1:7.1 2:3.0 3:5.9 4:2.1
  104. 3 1:6.3 2:2.9 3:5.6 4:1.8
  105. 3 1:6.5 2:3.0 3:5.8 4:2.2
  106. 3 1:7.6 2:3.0 3:6.6 4:2.1
  107. 3 1:4.9 2:2.5 3:4.5 4:1.7
  108. 3 1:7.3 2:2.9 3:6.3 4:1.8
  109. 3 1:6.7 2:2.5 3:5.8 4:1.8
  110. 3 1:7.2 2:3.6 3:6.1 4:2.5
  111. 3 1:6.5 2:3.2 3:5.1 4:2.0
  112. 3 1:6.4 2:2.7 3:5.3 4:1.9
  113. 3 1:6.8 2:3.0 3:5.5 4:2.1
  114. 3 1:5.7 2:2.5 3:5.0 4:2.0
  115. 3 1:5.8 2:2.8 3:5.1 4:2.4
  116. 3 1:6.4 2:3.2 3:5.3 4:2.3
  117. 3 1:6.5 2:3.0 3:5.5 4:1.8
  118. 3 1:7.7 2:3.8 3:6.7 4:2.2
  119. 3 1:7.7 2:2.6 3:6.9 4:2.3
  120. 3 1:6.0 2:2.2 3:5.0 4:1.5
  121. 3 1:6.9 2:3.2 3:5.7 4:2.3
  122. 3 1:5.6 2:2.8 3:4.9 4:2.0
  123. 3 1:7.7 2:2.8 3:6.7 4:2.0
  124. 3 1:6.3 2:2.7 3:4.9 4:1.8
  125. 3 1:6.7 2:3.3 3:5.7 4:2.1
  126. 3 1:7.2 2:3.2 3:6.0 4:1.8
  127. 3 1:6.2 2:2.8 3:4.8 4:1.8
  128. 3 1:6.1 2:3.0 3:4.9 4:1.8
  129. 3 1:6.4 2:2.8 3:5.6 4:2.1
  130. 3 1:7.2 2:3.0 3:5.8 4:1.6
  131. 3 1:7.4 2:2.8 3:6.1 4:1.9
  132. 3 1:7.9 2:3.8 3:6.4 4:2.0
  133. 3 1:6.4 2:2.8 3:5.6 4:2.2
  134. 3 1:6.3 2:2.8 3:5.1 4:1.5
  135. 3 1:6.1 2:2.6 3:5.6 4:1.4
  136. 3 1:7.7 2:3.0 3:6.1 4:2.3
  137. 3 1:6.3 2:3.4 3:5.6 4:2.4
  138. 3 1:6.4 2:3.1 3:5.5 4:1.8
  139. 3 1:6.0 2:3.0 3:4.8 4:1.8
  140. 3 1:6.9 2:3.1 3:5.4 4:2.1
  141. 3 1:6.7 2:3.1 3:5.6 4:2.4
  142. 3 1:6.9 2:3.1 3:5.1 4:2.3
  143. 3 1:5.8 2:2.7 3:5.1 4:1.9
  144. 3 1:6.8 2:3.2 3:5.9 4:2.3
  145. 3 1:6.7 2:3.3 3:5.7 4:2.5
  146. 3 1:6.7 2:3.0 3:5.2 4:2.3
  147. 3 1:6.3 2:2.5 3:5.0 4:1.9
  148. 3 1:6.5 2:3.0 3:5.2 4:2.0
  149. 3 1:6.2 2:3.4 3:5.4 4:2.3
  150. 3 1:5.9 2:3.0 3:5.1 4:1.8

B:maven依赖准备(项目依赖,偷懒,不想择取,酌情使用)

  1. <repositories>
  2. <repository>
  3. <id>ali-repo</id>
  4. <name>ali-repo</name>
  5. <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
  6. <layout>default</layout>
  7. </repository>
  8. <repository>
  9. <id>mvn-repo</id>
  10. <name>mvn-repo</name>
  11. <url>https://mvnrepository.com</url>
  12. </repository>
  13. <repository>
  14. <id>cdh-repo</id>
  15. <name>cdh-repo</name>
  16. <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
  17. </repository>
  18. <repository>
  19. <id>hdp-repo</id>
  20. <name>hdp-repo</name>
  21. <url>http://repo.hortonworks.com/content/repositories/releases/</url>
  22. </repository>
  23. </repositories>
  24. <properties>
  25. <java.version>1.8</java.version>
  26. <!-- project compiler -->
  27. <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  28. <maven.compiler.encoding>UTF-8</maven.compiler.encoding>
  29. <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
  30. <maven.build.timestamp.format>yyyyMMddHHmmss</maven.build.timestamp.format>
  31. <scala.version>2.11.8</scala.version>
  32. <hadoop.version>2.6.0-cdh5.14.0</hadoop.version>
  33. <spark.version>2.2.0</spark.version>
  34. <hive.version>1.1.0-cdh5.14.0</hive.version>
  35. <oozie.version>4.1.0-cdh5.14.0</oozie.version>
  36. <hbase.version>1.2.0-cdh5.14.0</hbase.version>
  37. <solr.version>4.10.3-cdh5.14.0</solr.version>
  38. <jsch.version>0.1.53</jsch.version>
  39. <jackson.spark.version>2.6.5</jackson.spark.version>
  40. <mysql.version>5.1.46</mysql.version>
  41. <!-- maven plugins -->
  42. <mybatis-generator-maven-plugin.version>1.3.5</mybatis-generator-maven-plugin.version>
  43. <maven-surefire-plugin.version>2.19.1</maven-surefire-plugin.version>
  44. <maven-shade-plugin.version>3.2.1</maven-shade-plugin.version>
  45. <wagon-ssh.version>3.1.0</wagon-ssh.version>
  46. <wagon-maven-plugin.version>2.0.0</wagon-maven-plugin.version>
  47. <maven-compiler-plugin.version>3.1</maven-compiler-plugin.version>
  48. <maven-war-plugin.version>3.2.1</maven-war-plugin.version>
  49. <jetty-maven-plugin.version>9.4.10.v20180503</jetty-maven-plugin.version>
  50. </properties>
  51. <dependencies>
  52. <!-- Scala -->
  53. <dependency>
  54. <groupId>org.scala-lang</groupId>
  55. <artifactId>scala-library</artifactId>
  56. <version>${scala.version}</version>
  57. </dependency>
  58. <!-- jackson -->
  59. <dependency>
  60. <groupId>com.fasterxml.jackson.core</groupId>
  61. <artifactId>jackson-databind</artifactId>
  62. <version>${jackson.spark.version}</version>
  63. </dependency>
  64. <dependency>
  65. <groupId>com.fasterxml.jackson.core</groupId>
  66. <artifactId>jackson-annotations</artifactId>
  67. <version>${jackson.spark.version}</version>
  68. </dependency>
  69. <dependency>
  70. <groupId>com.fasterxml.jackson.core</groupId>
  71. <artifactId>jackson-core</artifactId>
  72. <version>${jackson.spark.version}</version>
  73. </dependency>
  74. <!-- spark -->
  75. <dependency>
  76. <groupId>org.apache.spark</groupId>
  77. <artifactId>spark-core_2.11</artifactId>
  78. <version>${spark.version}</version>
  79. </dependency>
  80. <dependency>
  81. <groupId>org.apache.spark</groupId>
  82. <artifactId>spark-sql_2.11</artifactId>
  83. <version>${spark.version}</version>
  84. </dependency>
  85. <dependency>
  86. <groupId>org.apache.spark</groupId>
  87. <artifactId>spark-mllib_2.11</artifactId>
  88. <version>${spark.version}</version>
  89. <exclusions>
  90. <exclusion>
  91. <groupId>org.scalanlp</groupId>
  92. <artifactId>breeze_2.11</artifactId>
  93. </exclusion>
  94. </exclusions>
  95. </dependency>
  96. <dependency>
  97. <groupId>org.scalanlp</groupId>
  98. <artifactId>breeze_2.11</artifactId>
  99. <version>0.13</version>
  100. <exclusions>
  101. <exclusion>
  102. <groupId>org.scala-lang</groupId>
  103. <artifactId>scala-library</artifactId>
  104. </exclusion>
  105. </exclusions>
  106. </dependency>
  107. <!-- hadoop -->
  108. <dependency>
  109. <groupId>org.apache.hadoop</groupId>
  110. <artifactId>hadoop-client</artifactId>
  111. <version>${hadoop.version}</version>
  112. <exclusions>
  113. <exclusion>
  114. <groupId>org.mortbay.jetty</groupId>
  115. <artifactId>jetty</artifactId>
  116. </exclusion>
  117. <exclusion>
  118. <groupId>org.mortbay.jetty</groupId>
  119. <artifactId>jetty-util</artifactId>
  120. </exclusion>
  121. <exclusion>
  122. <groupId>org.codehaus.jackson</groupId>
  123. <artifactId>jackson-core-asl</artifactId>
  124. </exclusion>
  125. <exclusion>
  126. <groupId>org.codehaus.jackson</groupId>
  127. <artifactId>jackson-mapper-asl</artifactId>
  128. </exclusion>
  129. <exclusion>
  130. <groupId>org.mortbay.jetty</groupId>
  131. <artifactId>jetty-sslengine</artifactId>
  132. </exclusion>
  133. <exclusion>
  134. <groupId>org.codehaus.jackson</groupId>
  135. <artifactId>jackson-xc</artifactId>
  136. </exclusion>
  137. </exclusions>
  138. </dependency>
  139. <!-- hbase -->
  140. <dependency>
  141. <groupId>org.apache.hbase</groupId>
  142. <artifactId>hbase-client</artifactId>
  143. <version>${hbase.version}</version>
  144. </dependency>
  145. <dependency>
  146. <groupId>org.apache.hbase</groupId>
  147. <artifactId>hbase-common</artifactId>
  148. <version>${hbase.version}</version>
  149. <exclusions>
  150. <exclusion>
  151. <groupId>org.mortbay.jetty</groupId>
  152. <artifactId>jetty-util</artifactId>
  153. </exclusion>
  154. </exclusions>
  155. </dependency>
  156. <dependency>
  157. <groupId>org.apache.hbase</groupId>
  158. <artifactId>hbase-server</artifactId>
  159. <version>${hbase.version}</version>
  160. <exclusions>
  161. <exclusion>
  162. <groupId>org.mortbay.jetty</groupId>
  163. <artifactId>servlet-api-2.5</artifactId>
  164. </exclusion>
  165. <exclusion>
  166. <groupId>org.mortbay.jetty</groupId>
  167. <artifactId>jetty-util-6.1.26.hwx</artifactId>
  168. </exclusion>
  169. <exclusion>
  170. <groupId>org.mortbay.jetty</groupId>
  171. <artifactId>jetty-util</artifactId>
  172. </exclusion>
  173. <exclusion>
  174. <groupId>org.mortbay.jetty</groupId>
  175. <artifactId>jetty</artifactId>
  176. </exclusion>
  177. <exclusion>
  178. <groupId>org.mortbay.jetty</groupId>
  179. <artifactId>jetty-sslengine</artifactId>
  180. </exclusion>
  181. </exclusions>
  182. </dependency>
  183. <dependency>
  184. <groupId>org.apache.spark</groupId>
  185. <artifactId>spark-hive_2.11</artifactId>
  186. <version>${spark.version}</version>
  187. </dependency>
  188. <dependency>
  189. <groupId>org.apache.spark</groupId>
  190. <artifactId>spark-hive-thriftserver_2.11</artifactId>
  191. <version>${spark.version}</version>
  192. </dependency>
  193. <!-- solr -->
  194. <dependency>
  195. <groupId>org.apache.solr</groupId>
  196. <artifactId>solr-core</artifactId>
  197. <version>${solr.version}</version>
  198. </dependency>
  199. <dependency>
  200. <groupId>org.apache.solr</groupId>
  201. <artifactId>solr-solrj</artifactId>
  202. <version>${solr.version}</version>
  203. </dependency>
  204. <!-- mysql -->
  205. <dependency>
  206. <groupId>mysql</groupId>
  207. <artifactId>mysql-connector-java</artifactId>
  208. <version>${mysql.version}</version>
  209. </dependency>
  210. <dependency>
  211. <groupId>com.typesafe</groupId>
  212. <artifactId>config</artifactId>
  213. <version>1.2.1</version>
  214. </dependency>
  215. </dependencies>
  216. <build>
  217. <outputDirectory>target/classes</outputDirectory>
  218. <testOutputDirectory>target/test-classes</testOutputDirectory>
  219. <resources>
  220. <resource>
  221. <directory>${project.basedir}/src/main/resources</directory>
  222. </resource>
  223. </resources>
  224. <!-- Maven 编译的插件 -->
  225. <plugins>
  226. <plugin>
  227. <groupId>org.apache.maven.plugins</groupId>
  228. <artifactId>maven-compiler-plugin</artifactId>
  229. <version>3.0</version>
  230. <configuration>
  231. <source>1.8</source>
  232. <target>1.8</target>
  233. <encoding>UTF-8</encoding>
  234. </configuration>
  235. </plugin>
  236. <plugin>
  237. <groupId>net.alchim31.maven</groupId>
  238. <artifactId>scala-maven-plugin</artifactId>
  239. <version>3.2.0</version>
  240. <executions>
  241. <execution>
  242. <goals>
  243. <goal>compile</goal>
  244. <goal>testCompile</goal>
  245. </goals>
  246. </execution>
  247. </executions>
  248. </plugin>
  249. </plugins>
  250. </build>

C:代码如下:

  1. package ml
  2. import org.apache.spark.ml.clustering.{KMeans, KMeansModel}
  3. import org.apache.spark.ml.evaluation.ClusteringEvaluator
  4. import org.apache.spark.sql.{DataFrame, SparkSession}
  5. import org.apache.spark.sql.functions._
  6. import scala.collection.immutable
  7. /**
  8. * @Author: sou1yu
  9. * @Email:sou1yu@aliyun.com
  10. */
  11. object IrisClusterDemo {
  12. def main(args: Array[String]): Unit = {
  13. val spark: SparkSession = SparkSession.builder()
  14. .appName(this.getClass.getSimpleName.stripSuffix("$"))
  15. .master("local[3]")
  16. .config("spark.sql.shuffle.partitions", "2")
  17. .getOrCreate()
  18. import spark.implicits._
  19. //1.读取鸢尾花数据集
  20. val irisDF: DataFrame = spark.read.format("libsvm")
  21. .option("numFeatures", 4)
  22. .load("datas/iris_kmeans.txt")
  23. /**
  24. * irisDF.printSchema()
  25. * irisDF.show(10,false)
  26. * root
  27. * |-- label: double (nullable = true)
  28. * |-- features: vector (nullable = true)
  29. *
  30. * +-----+-------------------------------+
  31. * |label|features |
  32. * +-----+-------------------------------+
  33. * |1.0 |(4,[0,1,2,3],[5.1,3.5,1.4,0.2])|
  34. * |1.0 |(4,[0,1,2,3],[4.9,3.0,1.4,0.2])|
  35. * |1.0 |(4,[0,1,2,3],[4.7,3.2,1.3,0.2])|
  36. * |1.0 |(4,[0,1,2,3],[4.6,3.1,1.5,0.2])|
  37. * |1.0 |(4,[0,1,2,3],[5.0,3.6,1.4,0.2])|
  38. */
  39. //2.设置不同K值,从2-6,采用肘部法则确定K值
  40. val values: immutable.IndexedSeq[(Int, KMeansModel, String, Double)] = (2 to 6).map {
  41. k =>
  42. //a.创建KMeans算法实例对象,设置数值
  43. val kMeans = new KMeans()
  44. //设置输入特征列名称和输出列的名称
  45. .setFeaturesCol("features")
  46. .setPredictionCol("prediction")
  47. //动态设置K值
  48. .setK(k)
  49. //设置迭代次数
  50. .setMaxIter(50)
  51. //设置聚类模式,也可不设置。默认就是K-means算法,即k-means++变形体(随机初始化k*log2N【N代表数据集个数】个,再从中选取K个作为聚类中心点)
  52. .setInitMode("k-means||")
  53. //距离测量方式:欧几里得(默认)或余弦方式,测量数据坐标距离聚类中心点的长短的方式
  54. // .setDistanceMeasure("euclidean")
  55. .setDistanceMeasure("cosine")
  56. //b.应用数据集 训练模型 获取转换器
  57. val kmeansModel: KMeansModel = kMeans.fit(irisDF)
  58. //c 模型预测
  59. val predictionDF: DataFrame = kmeansModel.transform(irisDF)
  60. // 统计出各个类簇中的数据个数
  61. val clusterNumber: String = predictionDF.groupBy($"prediction").count()
  62. .select($"prediction", $"count")
  63. .as[(Int, Long)]
  64. .rdd
  65. .collectAsMap()
  66. .toMap
  67. .mkString(",")
  68. //d.模型评估
  69. val evaluator: ClusteringEvaluator = new ClusteringEvaluator()
  70. .setPredictionCol("prediction")
  71. //设置轮廓系数
  72. .setMetricName("silhouette")
  73. // 分别采用欧式距离计算距离(API中默认值)评估 和余弦计算距离
  74. //.setDistanceMeasure("squaredEuclidean")
  75. .setDistanceMeasure("cosine")
  76. /*轮廓系数(结合了聚类的凝聚度(Cohesion)【类簇中 数据距类簇中心的凝聚程度】和分离度(Separation)【各个类簇之间的分离程度】,
  77. 用于评估聚类的效果越接近1越好(平均轮廓系数的取值范围为[-1,1])。
  78. 但同时还要结合各个类簇中的数据个数尽量要平均
  79. * */
  80. val scValue: Double = evaluator.evaluate(predictionDF)
  81. //e.返回四元组
  82. (k, kmeansModel, clusterNumber, scValue)
  83. }
  84. //遍历指标
  85. values.foreach(println)
  86. //应用程序结束,关闭资源
  87. spark.stop()
  88. }
  89. }

D: 分别使用了欧几里得方式计算距离和余弦定理计算距离结果如下

 

欧几里得方式计算距离

K值 ,算法模型,各个类簇对应的数据个数,轮廓值
(2,kmeans_33af8f322a80,1 -> 97,0 -> 53,0.8501515983265806)
(3,kmeans_dddad8bd3858,2 -> 39,1 -> 50,0 -> 61,0.7342113066202725)
(4,kmeans_251d99eaeae4,2 -> 28,1 -> 50,3 -> 43,0 -> 29,0.6748661728223084)
(5,kmeans_5a9a066aaa9a,0 -> 23,1 -> 33,2 -> 30,3 -> 47,4 -> 17,0.5593200358940349)
(6,kmeans_734c87051c61,0 -> 30,5 -> 18,1 -> 19,2 -> 47,3 -> 23,4 -> 13,0.5157126401818913)

 余弦定理计算距离

K值 ,算法模型,各个类簇对应的数据个数,轮廓值
(2,kmeans_99c4cabaa950,1 -> 50,0 -> 100,0.9579554849242657)
(3,kmeans_73251a945156,2 -> 46,1 -> 50,0 -> 54,0.7484647230660575)
(4,kmeans_5f8bce0297d5,2 -> 46,1 -> 19,3 -> 31,0 -> 54,0.5754341193280768)
(5,kmeans_92f07728d30f,0 -> 27,1 -> 50,2 -> 23,3 -> 28,4 -> 22,0.6430770644178772)
(6,kmeans_acbd159f5a1e,0 -> 24,5 -> 21,1 -> 29,2 -> 43,3 -> 15,4 -> 18,0.4512255960897416)

 心得:根据以上数据得出结论,列簇分K设置为3更合适。

余弦距离使用两个向量夹角的余弦值作为衡量两个个体间差异的大小。 相比欧氏距离,余 弦距离更加注重两个向量在方向上的差异。借助三维坐标系来看下欧氏距离和余弦距离的区别:

总结:在日常使用中需要注意区分,余弦距离虽然不是一个严格意义上的距离度量公式,但是形容两个特征向量之间的关系还是有很大用处的。比如人脸识别,推荐系统等。

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/很楠不爱3/article/detail/132413
推荐阅读
相关标签
  

闽ICP备14008679号