当前位置:   article > 正文

R语言线性回归模型和lasso变量选择模型

R语言线性回归模型和lasso变量选择模型

一、数据描述

保险收费数据集:insurance.csv,各变量为年龄age,性别sex,健康指数bmi,孩子数量children,是否吸烟smoker,区域region,保险收费charges.

要求:保险收费charges

二、导入数据分析包,进行数据处理

library(dplyr)

library(caret)

library(MASS)

library(glmnet)

insurance<-read.csv('f:/桌面/insurance.csv',colClasses = c('numeric','character',rep('numeric',2),
                                                         rep('character',2),'numeric'))

head(insurance)

运行得到:

head(insurance)
  age    sex    bmi children smoker    region   charges
1  19 female 27.900        0    yes southwest 16884.924
2  18   male 33.770        1     no southeast  1725.552
3  28   male 33.000        3     no southeast  4449.462
4  33   male 22.705        0     no northwest 21984.471
5  32   male 28.880        0     no northwest  3866.855
6  31 female 25.740        0     no southeast  3756.622

1、把因变量保险费支出charges转换对数

insurance<-insurance %>% mutate(log_charges=log(charges))

2、把数据集随机抽样为insurance_learning和insurance_test数据集

id_insurance<-sample(1:nrow(insurance),round(0.7*nrow(insurance)))
insurance_learning<-insurance[id_insurance,]
insurance_test<-insurance[-id_insurance,]

三、一般线性回归

1、fit_lm<-lm(log_charges~age+sex+bmi+children+smoker+region,data=insurance_learning)
summary(fit_lm)

运行得到了回归模型的系数。

summary(fit_lm)

Call:
lm(formula = log_charges ~ age + sex + bmi + children + smoker + 
    region, data = insurance_learning)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.08751 -0.20558 -0.04740  0.07225  2.12026 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      6.938432   0.085712  80.951  < 2e-16 ***
age              0.035413   0.001024  34.568  < 2e-16 ***
sexmale         -0.068849   0.028469  -2.418  0.01578 *  
bmi              0.015198   0.002477   6.136 1.25e-09 ***
children         0.099796   0.011731   8.507  < 2e-16 ***
smokeryes        1.580321   0.035125  44.992  < 2e-16 ***
regionnorthwest -0.091624   0.040773  -2.247  0.02486 *  
regionsoutheast -0.167820   0.041432  -4.050 5.54e-05 ***
regionsouthwest -0.118894   0.040925  -2.905  0.00376 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4337 on 928 degrees of freedom
Multiple R-squared:  0.7815,	Adjusted R-squared:  0.7797 
F-statistic:   415 on 8 and 928 DF,  p-value: < 2.2e-16

2、将回归模型运用到验证数据集中

fit_pred<-predict(fit_lm,data=insurance_test)

fit_pred

3、计算保险费支出的预测的均方根误差

rmse.lm<-sqrt(mean((exp(fit_pred[1:401]-insurance_test$charges)^2))
rmse.lm

运行得到:

[1] 17810.45

四、lasso模型运用到回归分析,进行变量选择

1、将因变量转为哑变量

dmy <- dummyVars(~sex+smoker+region,
                 insurance_learning, 
                 fullRank = TRUE)

2、把哑变量加进模型并删除多余变量,对学习数据集合验证数据集做同样操作

insurance_learning2<-cbind(insurance_learning,predict(dmy,insurance_learning))
insurance_learning2<-insurance_learning2[,-c(2,5, 6,7,8)]

insurance_test2 <- cbind(insurance_test,predict(dmy,insurance_test)) 
insurance_test2 <- insurance_test2[,-c(2,5,6,7,8)]

3、使用交叉验证的lasso模型进行分析

a、lasso模型

cvfit.lasso<-cv.glmnet(as.matrix(insurance_learning2),insurance_learning$log_charges,family='gaussian')
cvfit.lasso$lambda.min

得到lambda.min交叉验证的误差最小的值:

[1] 0.001371769

使用lambda.min交叉验证的误差最小的值得到的回归系数为

coef(cvfit.lasso,s='lambda.min')

运行得到:

coef(cvfit.lasso,s='lambda.min')
9 x 1 sparse Matrix of class "dgCMatrix"
                         s1
(Intercept)      7.06202397
age              0.03529214
bmi              0.01086552
children         0.09122787
sexmale         -0.05158430
smokeryes        1.53999595
regionnorthwest -0.03962524
regionsoutheast -0.10544760
regionsouthwest -0.09973860

以上为是lasso回归模型方程的参数系数

b、使用交叉验证的误差最小的值得到的模型对测试数据的因变量的预测值

test.pred.lasso<-predict(cvfit.lasso,as.matrix(insurance_test2),s='lambda.min')

head(test.pred.lasso)得到拟合值。

2    7.998407
5    8.413959
6    8.330311
16   7.939771
20   9.893014
22   8.46432

c、使用lasso模型计算保险费支出的预测的均方根误差
rmse.rm<-sqrt(mean((exp(test.pred.lasso)-insurance_test$charges)^2))
rmse.rm 

运行得到:

rmse.rm<-sqrt(mean((exp(test.pred.lasso)-insurance_test$charges)^2))
> rmse.rm  
[1] 9284.007

结论:可见使用lasso模型进行变量选择得到的回归模型比普通的线性回归模型计算得到的预测均方根误差有较大的减少。

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小丑西瓜9/article/detail/333477
推荐阅读
相关标签
  

闽ICP备14008679号