赞
踩
距离函数度量:如欧氏距离(最短的直线距离),曼哈顿距离(类似城市街区路线)。欧氏距离公式:
p,q为比较的案例,n为特征
假设我们已知葡萄、绿豆、坚果、橙子等食品的分类和特征(脆度和甜度),现在想知道已知特征(甜度=6,脆度=4)的西红柿属于哪一类?计算与它的几个近邻之间的欧氏距离:
image.png
若K=1,西红柿和orange最近,归类为水果;
若K=3,3个近邻为orange,grape,nuts,三者之间投票表决,2/3归为水果,因而西红柿归类为水果。
偏差-方差权衡:过拟合与欠拟合之间的平衡。选择一个大的K会减少噪音数据对模型的影响,但过大会导致模型总是预测数量占大多数的那个类(几乎每个训练案例都会投票表决),而非最近的邻居;较小的K值会给出更复杂的决策边界,可更精细的拟合训练数据,但K过小则会使得噪音数据或异常值过度影响案例的分类(比如贴错标签)。
image.png
实际上,K的选取取决于学习概念的难度和训练集中案例的数量。一般,K为3-10。
(x-min(x))/(max(x)-min(x))
(x-mean(x))/(sd(x))
数据:569例细胞活检案例,每个案例32个特征(其中包含一个编号,一个癌症诊断结果:良性B/恶性M),使用KNN算法来识别肿瘤是恶性还是良性?
获取途径1:http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/
获取途径2:以下链接下载wisc_bc_data.csv
链接: https://pan.baidu.com/s/1Kdj6T8mp7YKraRLxEg3u1g 提取码: 9auq
查看数据,注意去除ID特征。
构造训练集和测试集最好都是来自数据全集的一个有代表性的子集(事先随机顺序)。
## Example: Classifying Cancer Samples ---- ## Step 2: Exploring and preparing the data ---- # import the CSV file wbcd <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE) # examine the structure of the wbcd data frame str(wbcd) # drop the id feature wbcd <- wbcd[-1] # table of diagnosis table(wbcd$diagnosis) # recode diagnosis as a factor wbcd$diagnosis <- factor(wbcd$diagnosis, levels = c("B", "M"), labels = c("Benign", "Malignant")) # table or proportions with more informative labels round(prop.table(table(wbcd$diagnosis)) * 100, digits = 1) # summarize three numeric features summary(wbcd[c("radius_mean", "area_mean", "smoothness_mean")]) # create normalization function normalize <- function(x) { return ((x - min(x)) / (max(x) - min(x))) } # test normalization function - result should be identical normalize(c(1, 2, 3, 4, 5)) normalize(c(10, 20, 30, 40, 50)) # normalize the wbcd data wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize)) # confirm that normalization worked summary(wbcd_n$area_mean) # create training and test data wbcd_train <- wbcd_n[1:469, ] wbcd_test <- wbcd_n[470:569, ] # create labels for training and test data wbcd_train_labels <- wbcd[1:469, 1] wbcd_test_labels <- wbcd[470:569, 1]
K最好使用奇数,这样会减少各个类票数相等的情况发生的可能性(如西红柿示例中K=2时)。
## Step 3: Training a model on the data ----
# load the "class" library
library(class)
wbcd_test_pred <- knn(train = wbcd_train,
test = wbcd_test,
cl = wbcd_train_labels,
k = 21) #训练集案例的平方根floor(sqrt(469))
即评估预测分类与测试分类中已知值得匹配程度。预测设计假阳性FP比率和假阴性FN比率之间的平衡。
乳腺癌分类的假阴性比假阳性付出的代价更大,即把恶性判断为良性。
## Step 4: Evaluating model performance ----
# load the "gmodels" library
library(gmodels)
# Create the cross tabulation of predicted vs. actual
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,
prop.chisq = FALSE)
image.png
①尝试将min-max标准化改为z-score标准化
## Step 5: Improving model performance ---- # use the scale() function to z-score standardize a data frame wbcd_z <- as.data.frame(scale(wbcd[-1])) # confirm that the transformation was applied correctly summary(wbcd_z$area_mean) # create training and test datasets wbcd_train <- wbcd_z[1:469, ] wbcd_test <- wbcd_z[470:569, ] # re-classify test cases wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k = 21) # Create the cross tabulation of predicted vs. actual CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq = FALSE)
image.png
正确分类从98%降为95%,且假阴性从2%提升到了5%,效果更差。
②尝试不同的K值
# try several different values of k wbcd_train <- wbcd_n[1:469, ] wbcd_test <- wbcd_n[470:569, ] wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k=1) CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq=FALSE) wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k=5) CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq=FALSE) wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k=11) CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq=FALSE) wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k=15) CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq=FALSE) wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k=21) CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq=FALSE) wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k=27) CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq=FALSE)
以上结果中,虽然K=1时的假阴性率最低,但是以增加假阳性结果为代价的。注意不能为了过于准确预测测试集来随意调整方法。
尽管kNN算法简单,但它能处理复杂的任务。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。