赞
踩
是降维方法的一种,在原始数据信息损失较小的情况下,利用少数的不相关的综合变量来代替较多的原始变量。一般来说,主成分不超过5个。函数为princomp()。
16年数据,7个变量。从图中可观察到存在多重共线性问题。
- library(corrplot)
- corr<-cor(longley)
- corrplot(corr)
- library(car)
- vif(lm(Employed~.,data=longley))
- -----结果------
- GNP.deflator GNP Unemployed Armed.Forces Population Year
- 135.53244 1788.51348 33.61889 3.58893 399.15102 758.98060
一般认为:0<VIF<10,不存在共线性;10<=VIF<100,存在较强多重共线性;VIF>100,多重共线性非常严重。上述表明GNP.deflator、GNP、Population和Year的多重共线性非常严重。
为解决多重共线性,接下来进行主成分分析。去掉响应变量之后进行分析:
- pr1<-princomp(longley[,-7],cor=TRUE)
- -----结果-----
- Standard deviations: 各主成分的标准差,即特征值的开平方
- Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
- 2.14554820 1.08413122 0.45102702 0.12218125 0.05051797 0.01940897
- 6 variables and 16 observations.
- summary(pr1)
- -------结果-------
- Importance of components:分别为主成分的标准差、方差贡献率和累积方差贡献率
- Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
- Standard deviation 2.1455482 1.0841312 0.45102702 0.122181253 0.0505179747 1.940897e-02
- Proportion of Variance 0.7672295 0.1958901 0.03390423 0.002488043 0.0004253443 6.278469e-05
- Cumulative Proportion 0.7672295 0.9631196 0.99702383 0.999511871 0.9999372153 1.000000e+00
一般选取累计贡献率在85%以上的主成分,本例的前两个主成分累计贡献率已达到96%,即可以解释原始变量96%的信息,故应该选择前两个主成分,剔除后四个主成分,达到了降维的目的。
- loadings(pr1)
- -------结果--------
- Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
- GNP.deflator 0.462 0.149 0.793 0.338 0.135
- GNP 0.462 0.278 -0.122 -0.150 -0.818
- Unemployed 0.321 -0.596 -0.728 -0.107
- Armed.Forces 0.202 0.798 -0.562
- Population 0.462 0.196 -0.590 0.549 0.312
- Year 0.465 0.128 -0.750 0.450
-
- Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
- SS loadings 1.000 1.000 1.000 1.000 1.000 1.000
- Proportion Var 0.167 0.167 0.167 0.167 0.167 0.167
- Cumulative Var 0.167 0.333 0.500 0.667 0.833 1.000
根据载荷矩阵,可以得到主成分与原始变量的线性关系:
- predict(pr1) # 计算主成分得分
- sort(predict(pr1)[,1],decreasing = T) #第一主成分降序排列
- -----结果-----
- Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 ...
- 1947 -3.59294203 -0.7761110 0.31804507 -0.169624216 0.009085721 ...
- 1948 -3.10924187 -0.8768853 0.66329350 0.130051755 0.063565251 ...
- 1949 -2.42014988 -1.5905016 -0.50961596 -0.009106066 0.005934717 ...
- ---------------
- 1962 1961 1960 1959 1958
- 3.45445683 3.17896368 2.43855615 2.00361238 1.87669233 ....
screeplot(pr1,type="lines",main="pr1")
折线图越靠近x轴,说明主成分解释原始变量的信息越多。可以看出选择两个主成分是合理的。
前三个主成分的累计贡献率为99.7%,将前三个主成分提出来,合并到longley数据集中。
- lm.pr1<-lm(Employed~F1+F2+F3,data=longley)
- summary(lm.pr1)
- -------结果--------
- Coefficients:
- Estimate Std. Error t value Pr(>|t|)
- (Intercept) 65.3170 0.1163 561.701 < 2e-16 ***
- F1 1.5154 0.0542 27.961 2.71e-12 ***
- F2 0.3794 0.1073 3.537 0.00409 **
- F3 1.8013 0.2578 6.987 1.46e-05 ***
- Multiple R-squared: 0.986, Adjusted R-squared: 0.9825
由结果看,各系数通过显著性检验且Adjusted R-squared为0.982,回归方程为:
输出原始变量所对应的回归系数:
- beta<-coef(lm.pr1) # [1] 4 1
- A<-loadings(pr1)[,1:3]
- x.mean<-pr1$center
- dim(as.matrix(x.mean))
- x.sd<-pr1$scale
- coef<-A%*%beta[2:4]/x.sd # (6,3)*(3,1)->(6,1)
- beta0<-beta[1]-x.mean%*%coef # 1
- c(beta0,coef)
- ------结果------
- [1] -3.587128e+02 9.478789e-02 1.267421e-02 -1.161491e-02
- [5] -5.987296e-03 1.538621e-01 2.029575e-01
转为原始变量回归方程:
ability.cov数据集,112个样本,六个变量。
- data("ability.cov")
- class(ability.cov)
- ---------结果---------
- $cov
- general picture blocks maze reading vocab
- general 24.641 5.991 33.520 6.023 20.755 29.701
- picture 5.991 6.700 18.137 1.782 4.936 7.204
- blocks 33.520 18.137 149.831 19.424 31.430 50.753
- maze 6.023 1.782 19.424 12.711 4.757 9.075
- reading 20.755 4.936 31.430 4.757 52.604 66.762
- vocab 29.701 7.204 50.753 9.075 66.762 135.292
- $center
- [1] 0 0 0 0 0 0
- $n.obs
- [1] 112
- ---------------
- [1] "list"
提取协方差矩阵,并利用函数cov2cor()将其转化为相关矩阵:
- corr<-cov2cor(cova)
- --------结果--------
- general picture blocks maze reading vocab
- general 1.0000000 0.4662649 0.5516632 0.3403250 0.5764799 0.5144058
- picture 0.4662649 1.0000000 0.5724364 0.1930992 0.2629229 0.2392766
- blocks 0.5516632 0.5724364 1.0000000 0.4450901 0.3540252 0.3564715
- maze 0.3403250 0.1930992 0.4450901 1.0000000 0.1839645 0.2188370
- reading 0.5764799 0.2629229 0.3540252 0.1839645 1.0000000 0.7913779
- vocab 0.5144058 0.2392766 0.3564715 0.2188370 0.7913779 1.0000000
绘制相关图,发现存在多重共线性问题。
- library(corrplot)
- corrplot(corr)
- (fa1<-factanal(covmat=corr,factors=2,rotrtion="none"))
- ------结果------
- Uniquenesses:特殊方差
- general picture blocks maze reading vocab
- 0.455 0.589 0.218 0.769 0.052 0.334
- Loadings:因子载荷矩阵
- Factor1 Factor2
- general 0.499 0.543
- picture 0.156 0.622
- blocks 0.206 0.860
- maze 0.109 0.468
- reading 0.956 0.182
- vocab 0.785 0.225
- Factor1 Factor2
- SS loadings 1.858 1.724
- Proportion Var 0.310 0.287
- Cumulative Var 0.310 0.597
接下来采用正交/斜交旋转法提取因子:
- update(fa1,factors=2,rotation="varimax") # 正交旋转
- update(fa1,factors=2,rotation="promax") # 斜交旋转
两种方法在因子载荷上有所不同:
- -------正交旋转-------
- Uniquenesses:
- general picture blocks maze reading vocab
- 0.455 0.589 0.218 0.769 0.052 0.334
- Loadings:
- Factor1 Factor2
- general 0.499 0.543 general=0.499*Factor1+0.543*Factor2
- picture 0.156 0.622
- blocks 0.206 0.860
- maze 0.109 0.468
- reading 0.956 0.182
- vocab 0.785 0.225
- --------斜交旋转--------
- Uniquenesses:
- general picture blocks maze reading vocab
- 0.455 0.589 0.218 0.769 0.052 0.334
- Loadings:
- Factor1 Factor2
- general 0.364 0.470
- picture 0.671
- blocks 0.932
- maze 0.508
- reading 1.023
- vocab 0.811
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。