赞
踩
K-Means是一种非常常见的聚类算法,在处理聚类任务中经常使用,K-Means算法是一种原型聚类算法。
该算法重要的一步就是确定K的值的划分,通常我们采用肘部法则选取K值,再依据轮廓系数,及各个数据集中数据的数量综合去评估哪个K值为最佳。
1 )、对于n个点的数据集,迭代计算k from 1 to n,每次聚类完成后计算每个点到其所属的簇中心的距离的平方和;
2)、平方和是会逐渐变小的,直到k时平方和为0,因为每个点都是它所在的簇中心本身。
3)、在这个平方和变化过程中,会出现一个拐点也即“肘”点,下降率突然变缓时即认为 是最佳的k值。 在决定什么时候停止训练时,肘形判据同样有效,数据通常有更多的噪音,在增加分类无法带来更多回报时,停止增加类别。
针对聚类算法另一个评估指标: 轮廓系数法 ,结合了聚类的凝聚度(Cohesion)和分离度 (Separation),用于评估聚类的效果:
1)、计算样本i到同簇其他样本的平均距离ai,ai 越小样本i的簇内不相似度越小,说明样 本i越应该被聚类到该簇。
2)、计算样本i到最近簇Cj 的所有样本的平均距离bij,称样本i与最近簇Cj 的不相似度,定 义为样本i的簇间不相似度:bi =min{bi1, bi2, ..., bik},bi越大,说明样本i越不属于其他簇。
3)、求出所有样本的轮廓系数后再求平均值就得到了平均轮廓系数。平均轮廓系数的取 值范围为[-1,1],系数越大,聚类效果越好。簇内样本的距离越近,簇间样本距离越远。
下面案例针对鸢尾花数据集进行聚类,使用KMeans算法,采用肘部法则Elbow获取K的值,使用轮廓系数评估模型。
准备工作:
A: 数据准备 iris_kmeans.txt
1 1:5.1 2:3.5 3:1.4 4:0.2 1 1:4.9 2:3.0 3:1.4 4:0.2 1 1:4.7 2:3.2 3:1.3 4:0.2 1 1:4.6 2:3.1 3:1.5 4:0.2 1 1:5.0 2:3.6 3:1.4 4:0.2 1 1:5.4 2:3.9 3:1.7 4:0.4 1 1:4.6 2:3.4 3:1.4 4:0.3 1 1:5.0 2:3.4 3:1.5 4:0.2 1 1:4.4 2:2.9 3:1.4 4:0.2 1 1:4.9 2:3.1 3:1.5 4:0.1 1 1:5.4 2:3.7 3:1.5 4:0.2 1 1:4.8 2:3.4 3:1.6 4:0.2 1 1:4.8 2:3.0 3:1.4 4:0.1 1 1:4.3 2:3.0 3:1.1 4:0.1 1 1:5.8 2:4.0 3:1.2 4:0.2 1 1:5.7 2:4.4 3:1.5 4:0.4 1 1:5.4 2:3.9 3:1.3 4:0.4 1 1:5.1 2:3.5 3:1.4 4:0.3 1 1:5.7 2:3.8 3:1.7 4:0.3 1 1:5.1 2:3.8 3:1.5 4:0.3 1 1:5.4 2:3.4 3:1.7 4:0.2 1 1:5.1 2:3.7 3:1.5 4:0.4 1 1:4.6 2:3.6 3:1.0 4:0.2 1 1:5.1 2:3.3 3:1.7 4:0.5 1 1:4.8 2:3.4 3:1.9 4:0.2 1 1:5.0 2:3.0 3:1.6 4:0.2 1 1:5.0 2:3.4 3:1.6 4:0.4 1 1:5.2 2:3.5 3:1.5 4:0.2 1 1:5.2 2:3.4 3:1.4 4:0.2 1 1:4.7 2:3.2 3:1.6 4:0.2 1 1:4.8 2:3.1 3:1.6 4:0.2 1 1:5.4 2:3.4 3:1.5 4:0.4 1 1:5.2 2:4.1 3:1.5 4:0.1 1 1:5.5 2:4.2 3:1.4 4:0.2 1 1:4.9 2:3.1 3:1.5 4:0.1 1 1:5.0 2:3.2 3:1.2 4:0.2 1 1:5.5 2:3.5 3:1.3 4:0.2 1 1:4.9 2:3.1 3:1.5 4:0.1 1 1:4.4 2:3.0 3:1.3 4:0.2 1 1:5.1 2:3.4 3:1.5 4:0.2 1 1:5.0 2:3.5 3:1.3 4:0.3 1 1:4.5 2:2.3 3:1.3 4:0.3 1 1:4.4 2:3.2 3:1.3 4:0.2 1 1:5.0 2:3.5 3:1.6 4:0.6 1 1:5.1 2:3.8 3:1.9 4:0.4 1 1:4.8 2:3.0 3:1.4 4:0.3 1 1:5.1 2:3.8 3:1.6 4:0.2 1 1:4.6 2:3.2 3:1.4 4:0.2 1 1:5.3 2:3.7 3:1.5 4:0.2 1 1:5.0 2:3.3 3:1.4 4:0.2 2 1:7.0 2:3.2 3:4.7 4:1.4 2 1:6.4 2:3.2 3:4.5 4:1.5 2 1:6.9 2:3.1 3:4.9 4:1.5 2 1:5.5 2:2.3 3:4.0 4:1.3 2 1:6.5 2:2.8 3:4.6 4:1.5 2 1:5.7 2:2.8 3:4.5 4:1.3 2 1:6.3 2:3.3 3:4.7 4:1.6 2 1:4.9 2:2.4 3:3.3 4:1.0 2 1:6.6 2:2.9 3:4.6 4:1.3 2 1:5.2 2:2.7 3:3.9 4:1.4 2 1:5.0 2:2.0 3:3.5 4:1.0 2 1:5.9 2:3.0 3:4.2 4:1.5 2 1:6.0 2:2.2 3:4.0 4:1.0 2 1:6.1 2:2.9 3:4.7 4:1.4 2 1:5.6 2:2.9 3:3.6 4:1.3 2 1:6.7 2:3.1 3:4.4 4:1.4 2 1:5.6 2:3.0 3:4.5 4:1.5 2 1:5.8 2:2.7 3:4.1 4:1.0 2 1:6.2 2:2.2 3:4.5 4:1.5 2 1:5.6 2:2.5 3:3.9 4:1.1 2 1:5.9 2:3.2 3:4.8 4:1.8 2 1:6.1 2:2.8 3:4.0 4:1.3 2 1:6.3 2:2.5 3:4.9 4:1.5 2 1:6.1 2:2.8 3:4.7 4:1.2 2 1:6.4 2:2.9 3:4.3 4:1.3 2 1:6.6 2:3.0 3:4.4 4:1.4 2 1:6.8 2:2.8 3:4.8 4:1.4 2 1:6.7 2:3.0 3:5.0 4:1.7 2 1:6.0 2:2.9 3:4.5 4:1.5 2 1:5.7 2:2.6 3:3.5 4:1.0 2 1:5.5 2:2.4 3:3.8 4:1.1 2 1:5.5 2:2.4 3:3.7 4:1.0 2 1:5.8 2:2.7 3:3.9 4:1.2 2 1:6.0 2:2.7 3:5.1 4:1.6 2 1:5.4 2:3.0 3:4.5 4:1.5 2 1:6.0 2:3.4 3:4.5 4:1.6 2 1:6.7 2:3.1 3:4.7 4:1.5 2 1:6.3 2:2.3 3:4.4 4:1.3 2 1:5.6 2:3.0 3:4.1 4:1.3 2 1:5.5 2:2.5 3:4.0 4:1.3 2 1:5.5 2:2.6 3:4.4 4:1.2 2 1:6.1 2:3.0 3:4.6 4:1.4 2 1:5.8 2:2.6 3:4.0 4:1.2 2 1:5.0 2:2.3 3:3.3 4:1.0 2 1:5.6 2:2.7 3:4.2 4:1.3 2 1:5.7 2:3.0 3:4.2 4:1.2 2 1:5.7 2:2.9 3:4.2 4:1.3 2 1:6.2 2:2.9 3:4.3 4:1.3 2 1:5.1 2:2.5 3:3.0 4:1.1 2 1:5.7 2:2.8 3:4.1 4:1.3 3 1:6.3 2:3.3 3:6.0 4:2.5 3 1:5.8 2:2.7 3:5.1 4:1.9 3 1:7.1 2:3.0 3:5.9 4:2.1 3 1:6.3 2:2.9 3:5.6 4:1.8 3 1:6.5 2:3.0 3:5.8 4:2.2 3 1:7.6 2:3.0 3:6.6 4:2.1 3 1:4.9 2:2.5 3:4.5 4:1.7 3 1:7.3 2:2.9 3:6.3 4:1.8 3 1:6.7 2:2.5 3:5.8 4:1.8 3 1:7.2 2:3.6 3:6.1 4:2.5 3 1:6.5 2:3.2 3:5.1 4:2.0 3 1:6.4 2:2.7 3:5.3 4:1.9 3 1:6.8 2:3.0 3:5.5 4:2.1 3 1:5.7 2:2.5 3:5.0 4:2.0 3 1:5.8 2:2.8 3:5.1 4:2.4 3 1:6.4 2:3.2 3:5.3 4:2.3 3 1:6.5 2:3.0 3:5.5 4:1.8 3 1:7.7 2:3.8 3:6.7 4:2.2 3 1:7.7 2:2.6 3:6.9 4:2.3 3 1:6.0 2:2.2 3:5.0 4:1.5 3 1:6.9 2:3.2 3:5.7 4:2.3 3 1:5.6 2:2.8 3:4.9 4:2.0 3 1:7.7 2:2.8 3:6.7 4:2.0 3 1:6.3 2:2.7 3:4.9 4:1.8 3 1:6.7 2:3.3 3:5.7 4:2.1 3 1:7.2 2:3.2 3:6.0 4:1.8 3 1:6.2 2:2.8 3:4.8 4:1.8 3 1:6.1 2:3.0 3:4.9 4:1.8 3 1:6.4 2:2.8 3:5.6 4:2.1 3 1:7.2 2:3.0 3:5.8 4:1.6 3 1:7.4 2:2.8 3:6.1 4:1.9 3 1:7.9 2:3.8 3:6.4 4:2.0 3 1:6.4 2:2.8 3:5.6 4:2.2 3 1:6.3 2:2.8 3:5.1 4:1.5 3 1:6.1 2:2.6 3:5.6 4:1.4 3 1:7.7 2:3.0 3:6.1 4:2.3 3 1:6.3 2:3.4 3:5.6 4:2.4 3 1:6.4 2:3.1 3:5.5 4:1.8 3 1:6.0 2:3.0 3:4.8 4:1.8 3 1:6.9 2:3.1 3:5.4 4:2.1 3 1:6.7 2:3.1 3:5.6 4:2.4 3 1:6.9 2:3.1 3:5.1 4:2.3 3 1:5.8 2:2.7 3:5.1 4:1.9 3 1:6.8 2:3.2 3:5.9 4:2.3 3 1:6.7 2:3.3 3:5.7 4:2.5 3 1:6.7 2:3.0 3:5.2 4:2.3 3 1:6.3 2:2.5 3:5.0 4:1.9 3 1:6.5 2:3.0 3:5.2 4:2.0 3 1:6.2 2:3.4 3:5.4 4:2.3 3 1:5.9 2:3.0 3:5.1 4:1.8
B:maven依赖准备(项目依赖,偷懒,不想择取,酌情使用)
- <repositories>
- <repository>
- <id>ali-repo</id>
- <name>ali-repo</name>
- <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
- <layout>default</layout>
- </repository>
- <repository>
- <id>mvn-repo</id>
- <name>mvn-repo</name>
- <url>https://mvnrepository.com</url>
- </repository>
- <repository>
- <id>cdh-repo</id>
- <name>cdh-repo</name>
- <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
- </repository>
- <repository>
- <id>hdp-repo</id>
- <name>hdp-repo</name>
- <url>http://repo.hortonworks.com/content/repositories/releases/</url>
- </repository>
- </repositories>
-
- <properties>
- <java.version>1.8</java.version>
- <!-- project compiler -->
- <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
- <maven.compiler.encoding>UTF-8</maven.compiler.encoding>
- <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
- <maven.build.timestamp.format>yyyyMMddHHmmss</maven.build.timestamp.format>
-
- <scala.version>2.11.8</scala.version>
- <hadoop.version>2.6.0-cdh5.14.0</hadoop.version>
- <spark.version>2.2.0</spark.version>
- <hive.version>1.1.0-cdh5.14.0</hive.version>
- <oozie.version>4.1.0-cdh5.14.0</oozie.version>
- <hbase.version>1.2.0-cdh5.14.0</hbase.version>
- <solr.version>4.10.3-cdh5.14.0</solr.version>
- <jsch.version>0.1.53</jsch.version>
- <jackson.spark.version>2.6.5</jackson.spark.version>
- <mysql.version>5.1.46</mysql.version>
-
- <!-- maven plugins -->
- <mybatis-generator-maven-plugin.version>1.3.5</mybatis-generator-maven-plugin.version>
- <maven-surefire-plugin.version>2.19.1</maven-surefire-plugin.version>
- <maven-shade-plugin.version>3.2.1</maven-shade-plugin.version>
- <wagon-ssh.version>3.1.0</wagon-ssh.version>
- <wagon-maven-plugin.version>2.0.0</wagon-maven-plugin.version>
- <maven-compiler-plugin.version>3.1</maven-compiler-plugin.version>
- <maven-war-plugin.version>3.2.1</maven-war-plugin.version>
- <jetty-maven-plugin.version>9.4.10.v20180503</jetty-maven-plugin.version>
- </properties>
-
- <dependencies>
- <!-- Scala -->
- <dependency>
- <groupId>org.scala-lang</groupId>
- <artifactId>scala-library</artifactId>
- <version>${scala.version}</version>
- </dependency>
- <!-- jackson -->
- <dependency>
- <groupId>com.fasterxml.jackson.core</groupId>
- <artifactId>jackson-databind</artifactId>
- <version>${jackson.spark.version}</version>
- </dependency>
- <dependency>
- <groupId>com.fasterxml.jackson.core</groupId>
- <artifactId>jackson-annotations</artifactId>
- <version>${jackson.spark.version}</version>
- </dependency>
- <dependency>
- <groupId>com.fasterxml.jackson.core</groupId>
- <artifactId>jackson-core</artifactId>
- <version>${jackson.spark.version}</version>
- </dependency>
- <!-- spark -->
- <dependency>
- <groupId>org.apache.spark</groupId>
- <artifactId>spark-core_2.11</artifactId>
- <version>${spark.version}</version>
- </dependency>
- <dependency>
- <groupId>org.apache.spark</groupId>
- <artifactId>spark-sql_2.11</artifactId>
- <version>${spark.version}</version>
- </dependency>
- <dependency>
- <groupId>org.apache.spark</groupId>
- <artifactId>spark-mllib_2.11</artifactId>
- <version>${spark.version}</version>
- <exclusions>
- <exclusion>
- <groupId>org.scalanlp</groupId>
- <artifactId>breeze_2.11</artifactId>
- </exclusion>
- </exclusions>
- </dependency>
- <dependency>
- <groupId>org.scalanlp</groupId>
- <artifactId>breeze_2.11</artifactId>
- <version>0.13</version>
- <exclusions>
- <exclusion>
- <groupId>org.scala-lang</groupId>
- <artifactId>scala-library</artifactId>
- </exclusion>
- </exclusions>
- </dependency>
- <!-- hadoop -->
- <dependency>
- <groupId>org.apache.hadoop</groupId>
- <artifactId>hadoop-client</artifactId>
- <version>${hadoop.version}</version>
- <exclusions>
- <exclusion>
- <groupId>org.mortbay.jetty</groupId>
- <artifactId>jetty</artifactId>
- </exclusion>
- <exclusion>
- <groupId>org.mortbay.jetty</groupId>
- <artifactId>jetty-util</artifactId>
- </exclusion>
- <exclusion>
- <groupId>org.codehaus.jackson</groupId>
- <artifactId>jackson-core-asl</artifactId>
- </exclusion>
- <exclusion>
- <groupId>org.codehaus.jackson</groupId>
- <artifactId>jackson-mapper-asl</artifactId>
- </exclusion>
- <exclusion>
- <groupId>org.mortbay.jetty</groupId>
- <artifactId>jetty-sslengine</artifactId>
- </exclusion>
- <exclusion>
- <groupId>org.codehaus.jackson</groupId>
- <artifactId>jackson-xc</artifactId>
- </exclusion>
- </exclusions>
- </dependency>
- <!-- hbase -->
- <dependency>
- <groupId>org.apache.hbase</groupId>
- <artifactId>hbase-client</artifactId>
- <version>${hbase.version}</version>
- </dependency>
- <dependency>
- <groupId>org.apache.hbase</groupId>
- <artifactId>hbase-common</artifactId>
- <version>${hbase.version}</version>
- <exclusions>
- <exclusion>
- <groupId>org.mortbay.jetty</groupId>
- <artifactId>jetty-util</artifactId>
- </exclusion>
- </exclusions>
- </dependency>
- <dependency>
- <groupId>org.apache.hbase</groupId>
- <artifactId>hbase-server</artifactId>
- <version>${hbase.version}</version>
- <exclusions>
- <exclusion>
- <groupId>org.mortbay.jetty</groupId>
- <artifactId>servlet-api-2.5</artifactId>
- </exclusion>
- <exclusion>
- <groupId>org.mortbay.jetty</groupId>
- <artifactId>jetty-util-6.1.26.hwx</artifactId>
- </exclusion>
- <exclusion>
- <groupId>org.mortbay.jetty</groupId>
- <artifactId>jetty-util</artifactId>
- </exclusion>
- <exclusion>
- <groupId>org.mortbay.jetty</groupId>
- <artifactId>jetty</artifactId>
- </exclusion>
- <exclusion>
- <groupId>org.mortbay.jetty</groupId>
- <artifactId>jetty-sslengine</artifactId>
- </exclusion>
- </exclusions>
- </dependency>
-
- <dependency>
- <groupId>org.apache.spark</groupId>
- <artifactId>spark-hive_2.11</artifactId>
- <version>${spark.version}</version>
- </dependency>
- <dependency>
- <groupId>org.apache.spark</groupId>
- <artifactId>spark-hive-thriftserver_2.11</artifactId>
- <version>${spark.version}</version>
- </dependency>
-
- <!-- solr -->
- <dependency>
- <groupId>org.apache.solr</groupId>
- <artifactId>solr-core</artifactId>
- <version>${solr.version}</version>
- </dependency>
- <dependency>
- <groupId>org.apache.solr</groupId>
- <artifactId>solr-solrj</artifactId>
- <version>${solr.version}</version>
- </dependency>
- <!-- mysql -->
- <dependency>
- <groupId>mysql</groupId>
- <artifactId>mysql-connector-java</artifactId>
- <version>${mysql.version}</version>
- </dependency>
-
- <dependency>
- <groupId>com.typesafe</groupId>
- <artifactId>config</artifactId>
- <version>1.2.1</version>
- </dependency>
- </dependencies>
-
- <build>
- <outputDirectory>target/classes</outputDirectory>
- <testOutputDirectory>target/test-classes</testOutputDirectory>
- <resources>
- <resource>
- <directory>${project.basedir}/src/main/resources</directory>
- </resource>
- </resources>
- <!-- Maven 编译的插件 -->
- <plugins>
- <plugin>
- <groupId>org.apache.maven.plugins</groupId>
- <artifactId>maven-compiler-plugin</artifactId>
- <version>3.0</version>
- <configuration>
- <source>1.8</source>
- <target>1.8</target>
- <encoding>UTF-8</encoding>
- </configuration>
- </plugin>
- <plugin>
- <groupId>net.alchim31.maven</groupId>
- <artifactId>scala-maven-plugin</artifactId>
- <version>3.2.0</version>
- <executions>
- <execution>
- <goals>
- <goal>compile</goal>
- <goal>testCompile</goal>
- </goals>
- </execution>
- </executions>
- </plugin>
- </plugins>
- </build>
C:代码如下:
- package ml
-
- import org.apache.spark.ml.clustering.{KMeans, KMeansModel}
- import org.apache.spark.ml.evaluation.ClusteringEvaluator
- import org.apache.spark.sql.{DataFrame, SparkSession}
- import org.apache.spark.sql.functions._
-
- import scala.collection.immutable
-
- /**
- * @Author: sou1yu
- * @Email:sou1yu@aliyun.com
- */
-
- object IrisClusterDemo {
- def main(args: Array[String]): Unit = {
- val spark: SparkSession = SparkSession.builder()
- .appName(this.getClass.getSimpleName.stripSuffix("$"))
- .master("local[3]")
- .config("spark.sql.shuffle.partitions", "2")
- .getOrCreate()
-
- import spark.implicits._
-
- //1.读取鸢尾花数据集
- val irisDF: DataFrame = spark.read.format("libsvm")
- .option("numFeatures", 4)
- .load("datas/iris_kmeans.txt")
-
- /**
- * irisDF.printSchema()
- * irisDF.show(10,false)
- * root
- * |-- label: double (nullable = true)
- * |-- features: vector (nullable = true)
- *
- * +-----+-------------------------------+
- * |label|features |
- * +-----+-------------------------------+
- * |1.0 |(4,[0,1,2,3],[5.1,3.5,1.4,0.2])|
- * |1.0 |(4,[0,1,2,3],[4.9,3.0,1.4,0.2])|
- * |1.0 |(4,[0,1,2,3],[4.7,3.2,1.3,0.2])|
- * |1.0 |(4,[0,1,2,3],[4.6,3.1,1.5,0.2])|
- * |1.0 |(4,[0,1,2,3],[5.0,3.6,1.4,0.2])|
- */
- //2.设置不同K值,从2-6,采用肘部法则确定K值
- val values: immutable.IndexedSeq[(Int, KMeansModel, String, Double)] = (2 to 6).map {
- k =>
- //a.创建KMeans算法实例对象,设置数值
- val kMeans = new KMeans()
- //设置输入特征列名称和输出列的名称
- .setFeaturesCol("features")
- .setPredictionCol("prediction")
- //动态设置K值
- .setK(k)
- //设置迭代次数
- .setMaxIter(50)
- //设置聚类模式,也可不设置。默认就是K-means算法,即k-means++变形体(随机初始化k*log2N【N代表数据集个数】个,再从中选取K个作为聚类中心点)
- .setInitMode("k-means||")
- //距离测量方式:欧几里得(默认)或余弦方式,测量数据坐标距离聚类中心点的长短的方式
- // .setDistanceMeasure("euclidean")
- .setDistanceMeasure("cosine")
-
- //b.应用数据集 训练模型 获取转换器
- val kmeansModel: KMeansModel = kMeans.fit(irisDF)
- //c 模型预测
- val predictionDF: DataFrame = kmeansModel.transform(irisDF)
-
- // 统计出各个类簇中的数据个数
- val clusterNumber: String = predictionDF.groupBy($"prediction").count()
- .select($"prediction", $"count")
- .as[(Int, Long)]
- .rdd
- .collectAsMap()
- .toMap
- .mkString(",")
-
- //d.模型评估
- val evaluator: ClusteringEvaluator = new ClusteringEvaluator()
- .setPredictionCol("prediction")
- //设置轮廓系数
- .setMetricName("silhouette")
- // 分别采用欧式距离计算距离(API中默认值)评估 和余弦计算距离
- //.setDistanceMeasure("squaredEuclidean")
- .setDistanceMeasure("cosine")
-
- /*轮廓系数(结合了聚类的凝聚度(Cohesion)【类簇中 数据距类簇中心的凝聚程度】和分离度(Separation)【各个类簇之间的分离程度】,
- 用于评估聚类的效果越接近1越好(平均轮廓系数的取值范围为[-1,1])。
- 但同时还要结合各个类簇中的数据个数尽量要平均
- * */
- val scValue: Double = evaluator.evaluate(predictionDF)
-
- //e.返回四元组
- (k, kmeansModel, clusterNumber, scValue)
- }
-
-
- //遍历指标
- values.foreach(println)
-
- //应用程序结束,关闭资源
- spark.stop()
-
- }
-
- }
D: 分别使用了欧几里得方式计算距离和余弦定理计算距离结果如下
欧几里得方式计算距离
K值 ,算法模型,各个类簇对应的数据个数,轮廓值
(2,kmeans_33af8f322a80,1 -> 97,0 -> 53,0.8501515983265806)
(3,kmeans_dddad8bd3858,2 -> 39,1 -> 50,0 -> 61,0.7342113066202725)
(4,kmeans_251d99eaeae4,2 -> 28,1 -> 50,3 -> 43,0 -> 29,0.6748661728223084)
(5,kmeans_5a9a066aaa9a,0 -> 23,1 -> 33,2 -> 30,3 -> 47,4 -> 17,0.5593200358940349)
(6,kmeans_734c87051c61,0 -> 30,5 -> 18,1 -> 19,2 -> 47,3 -> 23,4 -> 13,0.5157126401818913)
余弦定理计算距离
K值 ,算法模型,各个类簇对应的数据个数,轮廓值
(2,kmeans_99c4cabaa950,1 -> 50,0 -> 100,0.9579554849242657)
(3,kmeans_73251a945156,2 -> 46,1 -> 50,0 -> 54,0.7484647230660575)
(4,kmeans_5f8bce0297d5,2 -> 46,1 -> 19,3 -> 31,0 -> 54,0.5754341193280768)
(5,kmeans_92f07728d30f,0 -> 27,1 -> 50,2 -> 23,3 -> 28,4 -> 22,0.6430770644178772)
(6,kmeans_acbd159f5a1e,0 -> 24,5 -> 21,1 -> 29,2 -> 43,3 -> 15,4 -> 18,0.4512255960897416)
心得:根据以上数据得出结论,列簇分K设置为3更合适。
余弦距离使用两个向量夹角的余弦值作为衡量两个个体间差异的大小。 相比欧氏距离,余 弦距离更加注重两个向量在方向上的差异。借助三维坐标系来看下欧氏距离和余弦距离的区别:
总结:在日常使用中需要注意区分,余弦距离虽然不是一个严格意义上的距离度量公式,但是形容两个特征向量之间的关系还是有很大用处的。比如人脸识别,推荐系统等。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。