当前位置:   article > 正文

Logistic回归模型

logistic回归模型

Logistic 回归

1. Logistic 回归模型

Logistic 回归由统计学家David Cox(1958)提出,其实质是将数据拟合成到Logistic 模型中,从而预测事件发生的可能性。由于因变量是二分类的(也可以是多分类),因此可以代表指定某种事件发生与不发生的概率。

设因变量 y y y的取值为 { 0 , 1 } \{0,1\} {0,1} x 1 , x 2 , … x p x_1,x_2,\dots x_p x1,x2,xp y y y的解释变量,Logistic 回归就是研究 X = ( x 1 , x 2 , … x p ) X =(x_1,x_2,\dots x_p) X=(x1,x2,xp) y y y的影响关系。记
p = P ( y = 1 ∣ X ) ; 1 − p = P ( y = 0 ∣ X ) p = P(y=1|X);1-p = P(y=0|X) p=P(y=1∣X);1p=P(y=0∣X)
则概率比 p / ( 1 − p ) p/(1-p) p/(1p)的概率称作机会比(或优势比,odds)。这里因变量的期望为
E ( y ∣ X ) = 1 P + 0 ( 1 − p ) = p E(y|X) = 1P+0(1-p)=p E(yX)=1P+0(1p)=p
按照线性模型建模思路,有
y = β 0 + β 1 x 1 + … β p x p + ε y = \beta_0+\beta_1x_1+\dots \beta_px_p+\varepsilon y=β0+β1x1+βpxp+ε
其中 ε \varepsilon ε为扰动项。如果利用OLS方法估计,则为线性概率模型。但由于 y = 0 , 1 y =0,1 y=0,1,故扰动项 ε \varepsilon ε X X X存在相关性,从而导致内生性与异方差等问题。另外线性模型不能解释自变量 X X X取极端值时 y < 0 y<0 y<0 y > 1 y>1 y>1的情形,故考虑用连接函数使得
{ p ( y = 1 ∣ X ) = Λ ( X , β ) p ( y = 0 ∣ X ) = 1 − Λ ( X , β ) \left\{

p(y=1|X)=Λ(X,β)p(y=0|X)=1Λ(X,β)
\right. p(y=1∣X)=Λ(X,β)p(y=0∣X)=1Λ(X,β)
其中 Λ ( ) \Lambda() Λ()表示连接函数, β \beta β为参数。连接函数可以用标准正态累计分布函数与逻辑分布函数来表示,如果使用标准正态累计分布函数,则得到Probit模型;如果采取逻辑分布函数则为Logit模型。但考虑到用标准正态累计分布函数不存在解析式,一般采用逻辑分布函数,即
P ( y = 1 ∣ X ) = p = e x p ( X ′ β ) 1 + e x p ( X ′ β ) = e x p ( β 0 + β 1 x 1 + … β p x p ) 1 + e x p ( β 0 + β 1 x 1 + … β p x p )
P(y=1|X)=p=exp(Xβ)1+exp(Xβ)=exp(β0+β1x1+βpxp)1+exp(β0+β1x1+βpxp)
P(y=1∣X)=p=1+exp(Xβ)exp(Xβ)=1+exp(β0+β1x1+βpxp)exp(β0+β1x1+βpxp)

Logit分布密度函数关于原点对称,期望为0,方程为 π 2 / 3 \pi^2/3 π2/3,厚尾。由上式可推出对数机会比
O d d s = ln ⁡ ( p 1 − p ) = β 0 + β 1 x 1 + … β p x p Odds = \ln(\frac{p}{1-p})=\beta_0+\beta_1x_1+\dots \beta_px_p Odds=ln(1pp)=β0+β1x1+βpxp
上述模型表明,在其他不变条件下, x i x_i xi变动一个单位,其机会比对数将变化 β i \beta_i βi个单位,而非因变量变动 β i \beta_i βi个单位。


2.参数估计

由于 y y y服从0-1分布,故 y y y的概率函数可以写为
P ( y ) = p y ( 1 − p ) 1 − y ( y = 0 , 1 ) P(y) = p^y(1-p)^{1-y}(y=0,1) P(y)=py(1p)1y(y=0,1)
其似然函数为
L = ∏ P ( y ) = ∏ p y ( 1 − p ) 1 − y L= \prod {P(y)} = \prod { p^y(1-p)^{1-y}} L=P(y)=py(1p)1y
取对数得
l n L = ∑ [ y ln ⁡ p + ( 1 − y ) ln ⁡ ( 1 − p ) ] = ∑ [ y ln ⁡ p 1 − p + l n ( 1 − p ) ]

lnL=[ylnp+(1y)ln(1p)]=[ylnp1p+ln(1p)]
lnL=[ylnp+(1y)ln(1p)]=[yln1pp+ln(1p)]
p p p的表达式代入得
ln ⁡ L = ∑ { y ( β 0 + β 1 x 1 + … β p x p ) − [ 1 + e x p ( β 0 + β 1 x 1 + … β p x p ) ] }
lnL={y(β0+β1x1+βpxp)[1+exp(β0+β1x1+βpxp)]}
lnL={y(β0+β1x1+βpxp)[1+exp(β0+β1x1+βpxp)]}

其中一阶条件
∂ ln ⁡ L ∂ β j = 0 ( j = 0 , 1 , … , p ) \frac{\partial\ln L}{\partial\beta_j} =0(j=0,1,\dots,p) βjlnL=0(j=0,1,,p)
于是求出极大似然估计量 β j ^ ( j = 0 , 1 , … , p ) \hat{\beta_j}(j=0,1,\dots,p) βj^(j=0,1,,p)。再将 β j ^ ( j = 0 , 1 , … , p ) \hat{\beta_j}(j=0,1,\dots,p) βj^(j=0,1,,p)代回 P ( y = 1 ∣ X ) P(y=1|X) P(y=1∣X)中得
P ( y = 1 ∣ X ) = e x p ( β ^ 0 + β ^ 1 x 1 + … β ^ p x p ) 1 + e x p ( β ^ 0 + β ^ 1 x 1 + … β ^ p x p )
P(y=1|X)=exp(β^0+β^1x1+β^pxp)1+exp(β^0+β^1x1+β^pxp)
P(y=1∣X)=1+exp(β^0+β^1x1+β^pxp)exp(β^0+β^1x1+β^pxp)

当然
P ( y = 0 ∣ X ) = 1 1 + e x p ( β ^ 0 + β ^ 1 x 1 + … β ^ p x p )
P(y=0|X)=11+exp(β^0+β^1x1+β^pxp)
P(y=0∣X)=1+exp(β^0+β^1x1+β^pxp)1


3 软件实现

以数据集womenwk为例,构建如下模型:
 work  i = β 0 + β 1  age  i + β 2  married  i + β 3  children  i + β 4  education  i + ε i \text { work }_{i}=\beta_{0}+\beta_{1} \text { age }_{i}+\beta_{2} \text { married }_{i}+\beta_{3} \text { children }_{i}+\beta_{4} \text { education }_{i}+\varepsilon_{i}  work i=β0+β1 age i+β2 married i+β3 children i+β4 education i+εi
其中work:是否就业;age:年龄;marrie:婚否;children:子女数;education:教育年限

Stata代码如下:

*------------------------ Logistic 回归--------------------

cd "D:\master\笔记\markdown笔记\计量经济学\二值选择模型"

use womenwk.dta,clear
*变量含义:
*数据集womenwk
*work:是否就业
*age:年龄
*marrie:婚否
*children:子女数
*education:教育年限
*---------------------------LPM估计----------------------
reg work age married children education,r
/*
Linear regression                               Number of obs     =      2,000
                                                F(4, 1995)        =     192.58
                                                Prob > F          =     0.0000
                                                R-squared         =     0.2026
                                                Root MSE          =     .41992

------------------------------------------------------------------------------
             |               Robust
        work |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0102552   .0012236     8.38   0.000     .0078556    .0126548
     married |   .1111116   .0226719     4.90   0.000     .0666485    .1555748
    children |   .1153084   .0056978    20.24   0.000     .1041342    .1264827
   education |   .0186011   .0033006     5.64   0.000     .0121282     .025074
       _cons |  -.2073227   .0534581    -3.88   0.000    -.3121622   -.1024832
------------------------------------------------------------------------------
*/

*-----------------------------logit回归-----------------------------------
logit work age married children education,nolog

/*
Logistic regression                             Number of obs     =      2,000
                                                LR chi2(4)        =     476.62
                                                Prob > chi2       =     0.0000
Log likelihood = -1027.9144                     Pseudo R2         =     0.1882

------------------------------------------------------------------------------
        work |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0579303    .007221     8.02   0.000     .0437773    .0720833
     married |   .7417775   .1264705     5.87   0.000     .4938998    .9896552
    children |   .7644882   .0515289    14.84   0.000     .6634935     .865483
   education |   .0982513   .0186522     5.27   0.000     .0616936     .134809
       _cons |  -4.159247   .3320401   -12.53   0.000    -4.810034   -3.508461
------------------------------------------------------------------------------
*/

*稳健标准误logit
logit work age married children education,nolog r
/*
Logistic regression                             Number of obs     =      2,000
                                                Wald chi2(4)      =     344.54
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -1027.9144               Pseudo R2         =     0.1882

------------------------------------------------------------------------------
             |               Robust
        work |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0579303   .0072054     8.04   0.000     .0438079    .0720527
     married |   .7417775   .1272191     5.83   0.000     .4924326    .9911224
    children |   .7644882   .0497584    15.36   0.000     .6669635    .8620129
   education |   .0982513    .019011     5.17   0.000     .0609904    .1355121
       _cons |  -4.159247    .327398   -12.70   0.000    -4.800936   -3.517559
------------------------------------------------------------------------------
*/

*机率比汇报

logit work age married children education,nolog or

/*
Logistic regression                             Number of obs     =      2,000
                                                LR chi2(4)        =     476.62
                                                Prob > chi2       =     0.0000
Log likelihood = -1027.9144                     Pseudo R2         =     0.1882

------------------------------------------------------------------------------
        work | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   1.059641   .0076517     8.02   0.000      1.04475    1.074745
     married |   2.099664   .2655457     5.87   0.000     1.638694    2.690307
    children |   2.147895   .1106786    14.84   0.000     1.941563    2.376153
   education |    1.10324   .0205779     5.27   0.000     1.063636    1.144318
       _cons |   .0156193   .0051862   -12.53   0.000     .0081476     .029943
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.
*/
*---------------------边际效应-----------------
*样本均值处边际效应

margins,dydx(*) atmeans
/*
Conditional marginal effects                    Number of obs     =      2,000
Model VCE    : OIM

Expression   : Pr(work), predict()
dy/dx w.r.t. : age married children education
at           : age             =      36.208 (mean)
               married         =       .6705 (mean)
               children        =      1.6445 (mean)
               education       =      13.084 (mean)

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0115031   .0014236     8.08   0.000     .0087129    .0142934
     married |   .1472934   .0248209     5.93   0.000     .0986453    .1959415
    children |    .151803   .0093768    16.19   0.000     .1334249    .1701812
   education |   .0195096   .0036991     5.27   0.000     .0122596    .0267596
------------------------------------------------------------------------------

. 
end of do-file
*/

*---------------------指定变量取值处的边际效应-------------------
margins,dydx(*) at(age =30)
/*
Average marginal effects                        Number of obs     =      2,000
Model VCE    : OIM

Expression   : Pr(work), predict()
dy/dx w.r.t. : age married children education
at           : age             =          30

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |    .011179   .0014719     7.59   0.000      .008294    .0140639
     married |   .1431427   .0232525     6.16   0.000     .0975687    .1887167
    children |   .1475253   .0074033    19.93   0.000     .1330151    .1620355
   education |   .0189598   .0034727     5.46   0.000     .0121534    .0257662
------------------------------------------------------------------------------
*/
*------------------准确预测率------------------
estat clas
/*
Logistic model for work

              -------- True --------
Classified |         D            ~D  |      Total
-----------+--------------------------+-----------
     +     |      1177           361  |       1538
     -     |       166           296  |        462
-----------+--------------------------+-----------
   Total   |      1343           657  |       2000

Classified + if predicted Pr(D) >= .5
True D defined as work != 0
--------------------------------------------------
Sensitivity                     Pr( +| D)   87.64%
Specificity                     Pr( -|~D)   45.05%
Positive predictive value       Pr( D| +)   76.53%
Negative predictive value       Pr(~D| -)   64.07%
--------------------------------------------------
False + rate for true ~D        Pr( +|~D)   54.95%
False - rate for true D         Pr( -| D)   12.36%
False + rate for classified +   Pr(~D| +)   23.47%
False - rate for classified -   Pr( D| -)   35.93%
--------------------------------------------------
Correctly classified                        73.65%
--------------------------------------------------
*/

-END-

参考文献

陈强(2014),高级计量经济学及stata应用(第二版)

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/空白诗007/article/detail/776327
推荐阅读
相关标签
  

闽ICP备14008679号