当前位置:   article > 正文

使用线性回归、逻辑回归、决策树、随机森林进行泰坦尼克救援预测_回归 随机森林 泰坦尼克

回归 随机森林 泰坦尼克

泰坦尼克救援预测

from IPython.display import Image
Image(filename=r'C:\Users\a\Desktop\暑假\Titantic\QQ截图20190827081938.png',width=800)
  • 1
  • 2

在这里插入图片描述

第一步:数据分析

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
  • 1
  • 2
  • 3
#读取文件 查看数据描述
data = pd.read_csv('titanic_train.csv')
data.describe()
  • 1
  • 2
  • 3
PassengerIdSurvivedPclassAgeSibSpParchFare
count891.000000891.000000891.000000714.000000891.000000891.000000891.000000
mean446.0000000.3838382.30864229.6991180.5230080.38159432.204208
std257.3538420.4865920.83607114.5264971.1027430.80605749.693429
min1.0000000.0000001.0000000.4200000.0000000.0000000.000000
25%223.5000000.0000002.00000020.1250000.0000000.0000007.910400
50%446.0000000.0000003.00000028.0000000.0000000.00000014.454200
75%668.5000001.0000003.00000038.0000001.0000000.00000031.000000
max891.0000001.0000003.00000080.0000008.0000006.000000512.329200

可以看到数据中的Age列和Cabin列以及Embarked的列有缺失值,Cabin列缺失值数量太多,直接舍去,然后Ticket列对于实际的获救应该也没有什么关系

data.head()
  • 1
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

查阅当时的背景资料发现在当时泰坦尼克失事之后的逃生政策是妇女和儿童优先,下面查看一下妇女和儿童的逃生率,可以看到女性的平均获救人数是0.74远大于男性的

#查阅当时的背景资料发现在当时泰坦尼克失事之后的逃生政策是妇女和儿童优先,下面查看一下妇女和儿童的逃生率
data.pivot_table(index='Sex',values='Survived')  #可以看到女性的平均获救人数是0.74远大于男性的0.18
  • 1
  • 2
Survived
Sex
female0.742038
male0.188908

查看获救人群的平均年龄 然而好像说明不了什么问题

#查看获救人群的平均年龄   然而好像说明不了什么问题
data.pivot_table(index='Survived',values='Age')
  • 1
  • 2
Age
Survived
030.626179
128.343690

查看女性人群的平均年龄 然而好像说明不了什么问题

#查看女性人群的平均年龄  然而好像说明不了什么问题
data[data['Sex']=='female'].pivot_table(index='Survived',values='Age')
  • 1
  • 2
Age
Survived
025.046875
128.847716

查阅资料发现当时儿童的定义为14岁以下,查看儿童的获救率 发现随着年龄的增长获救率会降低这也印证了妇女和儿童优先的逃生政策

#查阅资料发现当时儿童的定义为14岁以下,查看儿童的获救率  发现随着年龄的增长获救率会降低这也印证了妇女和儿童优先的政策  
#所以我们认为年龄是一个重要特征
for i in np.arange(20):
    print(data[data['Age']<= i].pivot_table(index = 'Sex',values='Survived'))
  • 1
  • 2
  • 3
  • 4
Empty DataFrame
Columns: []
Index: []
        Survived
Sex             
female       1.0
male         0.8
        Survived
Sex             
female  0.600000
male    0.642857
        Survived
Sex             
female  0.583333
male    0.722222
        Survived
Sex             
female  0.705882
male    0.652174
        Survived
Sex             
female  0.761905
male    0.652174
        Survived
Sex             
female  0.739130
male    0.666667
        Survived
Sex             
female  0.750000
male    0.615385
        Survived
Sex             
female  0.730769
male    0.607143
        Survived
Sex             
female  0.633333
male    0.593750
        Survived
Sex             
female  0.612903
male    0.575758
        Survived
Sex             
female  0.593750
male    0.555556
        Survived
Sex             
female  0.593750
male    0.567568
        Survived
Sex             
female  0.617647
male    0.567568
        Survived
Sex             
female  0.631579
male    0.538462
        Survived
Sex             
female  0.651163
male    0.525000
        Survived
Sex             
female  0.673469
male    0.431373
        Survived
Sex             
female  0.690909
male    0.396552
        Survived
Sex             
female  0.676471
male    0.338028
        Survived
Sex             
female  0.706667
male    0.292135
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79

在当时的社会等级制度严格 查看一下三个船舱等级对应的获救率 发现船舱等级不同获救率也会有很大的不同所以船舱等级也是一个重要特征

#在当时的社会等级制度严格  查看一下三个船舱等级对应的获救率   发现船舱等级不同获救率也会有很大的不同所以船舱等级也是一个重要特征
data.pivot_table(index='Pclass',values='Survived')
  • 1
  • 2
Survived
Pclass
10.629630
20.472826
30.242363

那么登船地点会不会影响获救率呢 看起来登船地点对应的获救率也有较大区别,可能不同的登船地点上到船上的位置不同,距离逃生地点的远近也不同

#那么登船地点会不会影响获救率呢
data.pivot_table(index='Embarked',values='Survived')
#看起来登船地点对应的获救率也有较大区别,可能不同的登船地点上到船上的位置不同,距离逃生地点的远近也不同
  • 1
  • 2
  • 3
Survived
Embarked
C0.553571
Q0.389610
S0.336957

那么家里的兄弟姐妹的数量会不会影响获救率呢 可以看到越多的兄弟姐妹获救率越低

#那么家里的兄弟姐妹的数量会不会影响获救率呢  可以看到越多的兄弟姐妹获救率越低
data.pivot_table(index='SibSp',values='Survived')
  • 1
  • 2
Survived
SibSp
00.345395
10.535885
20.464286
30.250000
40.166667
50.000000
80.000000

那么家里老人和小孩的数量会不会影响获救率呢 可以看到总体来说老人和小孩的数量越多获救率也越大

#那么家里老人和小孩的数量会不会影响获救率呢     可以看到总体来说老人和小孩的数量越多获救率也越大
data.pivot_table(index='Parch',values='Survived')
  • 1
  • 2
Survived
Parch
00.343658
10.550847
20.500000
30.600000
40.000000
50.200000
60.000000

至此,我们认为重要特征为Pclass,Sex,Age,Embarked,SibSp,Parch

构造一个新的数据表

#至此,我们认为重要特征为Pclass,Sex,Age,Embarked,SibSp,Parch
#构造一个新的数据表
columns = ['Pclass','Sex','Age','SibSp','Parch','Embarked','Survived','Fare']
new_data = data[columns]
new_data.head()
  • 1
  • 2
  • 3
  • 4
  • 5
PclassSexAgeSibSpParchEmbarkedSurvivedFare
03male22.010S07.2500
11female38.010C171.2833
23female26.000S17.9250
31female35.010S153.1000
43male35.000S08.0500

查看缺失值

#查看缺失值
new_data.isnull().sum()
  • 1
  • 2
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Embarked      2
Survived      0
Fare          0
dtype: int64
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

填充缺失值,年龄填充为中位数,登船地点填充为众数

#填充缺失值,年龄填充为中位数,登船地点填充为众数
new_data['Age'].fillna(new_data['Age'].median(),inplace = True)
print(new_data['Age'].median())
print(new_data['Embarked'].mode())
# #查看数据表描述
new_data.describe()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
28.0
0    S
dtype: objec
  • 1
  • 2
  • 3
PclassAgeSibSpParchSurvivedFare
count891.000000891.000000891.000000891.000000891.000000891.000000
mean2.30864229.3615820.5230080.3815940.38383832.204208
std0.83607113.0196971.1027430.8060570.48659249.693429
min1.0000000.4200000.0000000.0000000.0000000.000000
25%2.00000022.0000000.0000000.0000000.0000007.910400
50%3.00000028.0000000.0000000.0000000.00000014.454200
75%3.00000035.0000001.0000000.0000001.00000031.000000
max3.00000080.0000008.0000006.0000001.000000512.329200

查看空值

new_data.isnull().sum()
  • 1
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Embarked    2
Survived    0
Fare        0
dtype: int64
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

将登船地点填充

#将登船地点填充
new_data["Embarked"] = new_data["Embarked"].fillna('S')
  • 1
  • 2
new_data.isnull().sum()
  • 1
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Embarked    0
Survived    0
Fare        0
dtype: int64
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9

将男性和女性的字符值转化为数值男0,女1

#将男性和女性的字符值转化为数值男0,女1
new_data.loc[new_data['Sex']=='male','Sex'] = 0
new_data.loc[new_data['Sex']=='female','Sex'] = 1
  • 1
  • 2
  • 3

将登船地点对应的字符值转化为数值 C:0,Q:1,S:2

#将登船地点对应的字符值转化为数值  C:0,Q:1,S:2
new_data.loc[new_data['Embarked']=='C','Embarked'] = 0
new_data.loc[new_data['Embarked']=='Q','Embarked'] = 1
new_data.loc[new_data['Embarked']=='S','Embarked'] = 2
new_data.head()
  • 1
  • 2
  • 3
  • 4
  • 5
PclassSexAgeSibSpParchEmbarkedSurvivedFare
03022.010207.2500
11138.0100171.2833
23126.000217.9250
31135.0102153.1000
43035.000208.0500

第二步:初步建模调整参数

使用线性回归

#开始建模,使用线性回归
from sklearn.linear_model import LinearRegression
#使用交叉验证方法
from sklearn.model_selection import KFold,cross_val_score
kf = KFold(5,random_state=0)
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Embarked"]
train_target = new_data['Survived']
LR = LinearRegression()
accuracys=[]
for train,test in kf.split(new_data):
    LR.fit(new_data.loc[train,predictors],new_data.loc[train,'Survived'])
    pred = LR.predict(new_data.loc[test,predictors])
    pred[pred >= 0.60] = 1
    pred[pred < 0.60] = 0
    accuracy = len(pred[pred == new_data.loc[test,'Survived']])/len(test)
    accuracys.append(accuracy)
print(np.mean(accuracys))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
0.8035653756826313
  • 1

使用逻辑回归

#开始建模,使用逻辑回归
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold,cross_val_score
kf = KFold(5,random_state=0)
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Embarked"]
lr = LogisticRegression(C = 0.1,solver='liblinear',penalty='l2')
lr.fit(new_data[predictors],new_data['Survived'])
print(cross_val_score(lr,new_data[predictors],new_data['Survived'],cv = kf).mean())
accuracys = []
for train,test in kf.split(new_data):
    lr.fit(new_data.loc[train,predictors],new_data.loc[train,'Survived'])
    pred = lr.predict_proba(new_data.loc[test,predictors])
#     print(pred.shape)
    new_pred = pred[:,1]
#     print(new_pred)
    new_pred[new_pred >= 0.50] = 1
    new_pred[new_pred < 0.50] = 0
    accuracy = len(new_pred[new_pred == new_data.loc[test,'Survived']])/len(test)
    accuracys.append(accuracy)
print(np.mean(accuracys))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
0.7956939300734418
0.7956939300734418
  • 1
  • 2

使用决策树

#开始建模,使用决策树
from sklearn import tree
dt = tree.DecisionTreeClassifier(min_samples_split=4, min_samples_leaf=4)
kf = KFold(5,random_state=0)
accuracys = []
for train,test in kf.split(new_data):
    dt.fit(new_data.loc[train,predictors],new_data.loc[train,'Survived'])
    pred = dt.predict(new_data.loc[test,predictors])
    accuracy = len(pred[pred == new_data.loc[test,'Survived']])/len(test)
    accuracys.append(accuracy)
print(np.mean(accuracys))
print(cross_val_score(dt,new_data[predictors],new_data['Survived'],cv=kf).mean())
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
0.804758018956751
0.8036344234511331
  • 1
  • 2

第三步:使用集成学习算法,随机森林

#开始建模,使用随机森林
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
alg = RandomForestClassifier(random_state=1, n_estimators=80, min_samples_split=4, min_samples_leaf=4)
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Embarked"]
kf = KFold(5,random_state=0)
scores = cross_val_score(alg, new_data[predictors], new_data["Survived"], cv=kf)  #方法一
print(scores.mean())
accuracys = []   #方法二
for train,test in kf.split(new_data):
    alg.fit(new_data.loc[train,predictors],new_data.loc[train,'Survived'])
    pred = alg.predict(new_data.loc[test,predictors])
    accuracy = len(pred[pred == new_data.loc[test,'Survived']])/len(test)
    accuracys.append(accuracy)
print(np.mean(accuracys))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
0.820475801895675
0.820475801895675
  • 1
  • 2
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/我家小花儿/article/detail/237139
推荐阅读
相关标签
  

闽ICP备14008679号