赞
踩
一、数据描述
保险收费数据集:insurance.csv,各变量为年龄age,性别sex,健康指数bmi,孩子数量children,是否吸烟smoker,区域region,保险收费charges.
要求:保险收费charges
二、导入数据分析包,进行数据处理
library(dplyr)
library(caret)
library(MASS)
library(glmnet)
insurance<-read.csv('f:/桌面/insurance.csv',colClasses = c('numeric','character',rep('numeric',2),
rep('character',2),'numeric'))
head(insurance)
运行得到:
head(insurance) age sex bmi children smoker region charges 1 19 female 27.900 0 yes southwest 16884.924 2 18 male 33.770 1 no southeast 1725.552 3 28 male 33.000 3 no southeast 4449.462 4 33 male 22.705 0 no northwest 21984.471 5 32 male 28.880 0 no northwest 3866.855 6 31 female 25.740 0 no southeast 3756.622
1、把因变量保险费支出charges转换对数
insurance<-insurance %>% mutate(log_charges=log(charges))
2、把数据集随机抽样为insurance_learning和insurance_test数据集
id_insurance<-sample(1:nrow(insurance),round(0.7*nrow(insurance)))
insurance_learning<-insurance[id_insurance,]
insurance_test<-insurance[-id_insurance,]
三、一般线性回归
1、fit_lm<-lm(log_charges~age+sex+bmi+children+smoker+region,data=insurance_learning)
summary(fit_lm)
运行得到了回归模型的系数。
summary(fit_lm) Call: lm(formula = log_charges ~ age + sex + bmi + children + smoker + region, data = insurance_learning) Residuals: Min 1Q Median 3Q Max -1.08751 -0.20558 -0.04740 0.07225 2.12026 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.938432 0.085712 80.951 < 2e-16 *** age 0.035413 0.001024 34.568 < 2e-16 *** sexmale -0.068849 0.028469 -2.418 0.01578 * bmi 0.015198 0.002477 6.136 1.25e-09 *** children 0.099796 0.011731 8.507 < 2e-16 *** smokeryes 1.580321 0.035125 44.992 < 2e-16 *** regionnorthwest -0.091624 0.040773 -2.247 0.02486 * regionsoutheast -0.167820 0.041432 -4.050 5.54e-05 *** regionsouthwest -0.118894 0.040925 -2.905 0.00376 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.4337 on 928 degrees of freedom Multiple R-squared: 0.7815, Adjusted R-squared: 0.7797 F-statistic: 415 on 8 and 928 DF, p-value: < 2.2e-16
2、将回归模型运用到验证数据集中
fit_pred<-predict(fit_lm,data=insurance_test)
fit_pred
3、计算保险费支出的预测的均方根误差
rmse.lm<-sqrt(mean((exp(fit_pred[1:401]-insurance_test$charges)^2)) rmse.lm
运行得到:
[1] 17810.45
四、lasso模型运用到回归分析,进行变量选择
1、将因变量转为哑变量
dmy <- dummyVars(~sex+smoker+region,
insurance_learning,
fullRank = TRUE)
2、把哑变量加进模型并删除多余变量,对学习数据集合验证数据集做同样操作
insurance_learning2<-cbind(insurance_learning,predict(dmy,insurance_learning))
insurance_learning2<-insurance_learning2[,-c(2,5, 6,7,8)]
insurance_test2 <- cbind(insurance_test,predict(dmy,insurance_test))
insurance_test2 <- insurance_test2[,-c(2,5,6,7,8)]
3、使用交叉验证的lasso模型进行分析
a、lasso模型
cvfit.lasso<-cv.glmnet(as.matrix(insurance_learning2),insurance_learning$log_charges,family='gaussian')
cvfit.lasso$lambda.min
得到lambda.min交叉验证的误差最小的值:
[1] 0.001371769
使用lambda.min交叉验证的误差最小的值得到的回归系数为
coef(cvfit.lasso,s='lambda.min')
运行得到:
coef(cvfit.lasso,s='lambda.min') 9 x 1 sparse Matrix of class "dgCMatrix" s1 (Intercept) 7.06202397 age 0.03529214 bmi 0.01086552 children 0.09122787 sexmale -0.05158430 smokeryes 1.53999595 regionnorthwest -0.03962524 regionsoutheast -0.10544760 regionsouthwest -0.09973860
以上为是lasso回归模型方程的参数系数
b、使用交叉验证的误差最小的值得到的模型对测试数据的因变量的预测值
test.pred.lasso<-predict(cvfit.lasso,as.matrix(insurance_test2),s='lambda.min')
head(test.pred.lasso)得到拟合值。
2 7.998407 5 8.413959 6 8.330311 16 7.939771 20 9.893014 22 8.46432 c、使用lasso模型计算保险费支出的预测的均方根误差 rmse.rm<-sqrt(mean((exp(test.pred.lasso)-insurance_test$charges)^2)) rmse.rm
运行得到:
rmse.rm<-sqrt(mean((exp(test.pred.lasso)-insurance_test$charges)^2)) > rmse.rm [1] 9284.007
结论:可见使用lasso模型进行变量选择得到的回归模型比普通的线性回归模型计算得到的预测均方根误差有较大的减少。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。