赞
踩
泰坦尼克号的沉没是历史上最臭名昭著的海难之一。1912年4月15日,在她的处女航中,被广泛认为的“沉没” RMS泰坦尼克号与冰山相撞后沉没。不幸的是,船上没有足够的救生艇供所有人使用,导致2224名乘客和机组人员中的1502人死亡。虽然幸存有一些运气,但似乎有些人比其他人更有可能生存。在这一挑战中,我们要求您建立一个预测模型来回答以下问题:“什么样的人更有可能生存?” 使用乘客数据(即姓名,年龄,性别,社会经济舱等)。
import numpy as np #科学计算
import pandas as pd #数据分析
import seaborn as sns #数据可视化
import matplotlib.pyplot as plt #数据可视化
%matplotlib inline
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
PassengerId–乘客ID
Pclass-------乘客等级(1/2/3等舱位)
Name---------乘客姓名
Sex----------性别
Age----------年龄
SibSp--------堂兄弟/妹个数
Parch--------父母与小孩个数
Ticket-------船票信息
Fare---------票价
Cabin--------客舱
Embarked-----登船港口
train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB
数据清理中最常用的技术是填充缺失数据。根据经验来讲,分类数据只能用众数,连续数据可以用中位数或平均数。
所以我们用众数来填充登船地数据,用中位数来填充年龄数据。缺失值较大的一般是暂时不考虑。当然,连续值还可以通过数据拟合来填充缺失数据。
train['Embarked'].fillna(train['Embarked'].mode()[0], inplace = True)
train['Age'].fillna(train['Age'].median(), inplace = True)
test['Age'].fillna(train['Age'].median(), inplace = True)
test['Fare'].fillna(train['Fare'].mode()[0], inplace = True)
test.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 11 columns): PassengerId 418 non-null int64 Pclass 418 non-null int64 Name 418 non-null object Sex 418 non-null object Age 418 non-null float64 SibSp 418 non-null int64 Parch 418 non-null int64 Ticket 418 non-null object Fare 418 non-null float64 Cabin 91 non-null object Embarked 418 non-null object dtypes: float64(2), int64(4), object(5) memory usage: 36.0+ KB
train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 891 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 891 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB
plt.hist(x = [train[train['Survived']==1]['Age'], train[train['Survived']==0]['Age']],
stacked=True, color = ['b','r'],label = ['Survived','Dead'])
plt.title('Age Histogram by Survival')
plt.xlabel('Age (Years)')
plt.ylabel('# of Passengers')
plt.legend()
<matplotlib.legend.Legend at 0x291c2f02b38>
从上述图表可以看出:
1、老年人遇难的比例最高
2、青年的遇难数量占遇难者的大部分,这是因为青年人比例占总人数中最多
3、青年的幸存者数量占幸存者的大部分
4、0-10岁儿童幸存比例最高
plt.hist(x = [train[train['Survived']==1]['Pclass'], train[train['Survived']==0]['Pclass']],
stacked=True, color = ['b','r'],label = ['Survived','Dead'])
plt.xticks([1,2,3])
plt.title('Pclass Histogram by Survival')
plt.xlabel('Pclass ')
plt.ylabel('# of Passengers')
plt.legend()
<matplotlib.legend.Legend at 0x291c2fbdb00>
从上述图表可以看出:
1、第三等级的乘客遇难率最高,遇难人数最多,乘客数也是最多;
2、第一等级乘客遇难率最低,幸存率最高,生还人数最多;
train['family_size']=train['SibSp']+train['Parch']+1
train.family_size.unique()
y=train[['family_size', 'Survived']].groupby(['family_size'],as_index=False).sum()
y.plot.bar(x='family_size',rot=45)
<matplotlib.axes._subplots.AxesSubplot at 0x291c3026f98>
plt.hist(x = [train[train['Survived']==1]['family_size'], train[train['Survived']==0]['family_size']],
stacked=True, color = ['b','r'],label = ['Survived','Dead'])
# plt.xticks([1, 2, 3, 4, 5, 6, 7, 8, 11])
plt.title('family_size Histogram by Survival')
plt.xlabel('family_size ')
plt.ylabel('# of Passengers')
plt.legend()
<matplotlib.legend.Legend at 0x291bf207ba8>
从上图可看出:
1、单身人士遇难人数最多,单身乘客数最多,生还人数最多;
2、家庭人数大于4人的家庭,遇难率最高,生还的可能性较小;
3、4人家庭的生还率是最高的;
plt.hist(x = [train[train['Survived']==1]['Embarked'], train[train['Survived']==0]['Embarked']],
stacked=True, color = ['b','r'],label = ['Survived','Dead'])
plt.title('Embarked Histogram by Survival')
plt.xlabel('Embaeked ')
plt.ylabel('# of Passengers')
plt.legend()
<matplotlib.legend.Legend at 0x291c31041d0>
train=train.drop(['family_size'],1)
train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 891 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 891 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB
train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
full=train.append(test,ignore_index=True,sort=False)
full.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1309 entries, 0 to 1308 Data columns (total 12 columns): PassengerId 1309 non-null int64 Survived 891 non-null float64 Pclass 1309 non-null int64 Name 1309 non-null object Sex 1309 non-null object Age 1309 non-null float64 SibSp 1309 non-null int64 Parch 1309 non-null int64 Ticket 1309 non-null object Fare 1309 non-null float64 Cabin 295 non-null object Embarked 1309 non-null object dtypes: float64(3), int64(4), object(5) memory usage: 122.8+ KB
将Cabin分为有值和无值两类
def set_Cabin_type(df):
df.loc[ (df.Cabin.notnull()), 'Cabin' ] = "Yes"
df.loc[ (df.Cabin.isnull()), 'Cabin' ] = "No"
return df
full = set_Cabin_type(full)
full.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | No | S |
1 | 2 | 1.0 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | Yes | C |
2 | 3 | 1.0 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | No | S |
3 | 4 | 1.0 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | Yes | S |
4 | 5 | 0.0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | No | S |
full=full.drop(['Ticket','PassengerId','Name'],axis=1)
full.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
Survived 891 non-null float64
Pclass 1309 non-null int64
Sex 1309 non-null object
Age 1309 non-null float64
SibSp 1309 non-null int64
Parch 1309 non-null int64
Fare 1309 non-null float64
Cabin 1309 non-null object
Embarked 1309 non-null object
dtypes: float64(3), int64(3), object(3)
memory usage: 92.1+ KB
性别离散化:
set_map={'male':1,
'female':0}
full['Sex']=full['Sex'].map(set_map)
set_map={'Yes':1,
'No':0}
full['Cabin']=full['Cabin'].map(set_map)
full.head()
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 3 | 1 | 22.0 | 1 | 0 | 7.2500 | 0 | S |
1 | 1.0 | 1 | 0 | 38.0 | 1 | 0 | 71.2833 | 1 | C |
2 | 1.0 | 3 | 0 | 26.0 | 0 | 0 | 7.9250 | 0 | S |
3 | 1.0 | 1 | 0 | 35.0 | 1 | 0 | 53.1000 | 1 | S |
4 | 0.0 | 3 | 1 | 35.0 | 0 | 0 | 8.0500 | 0 | S |
pclass=pd.DataFrame()
pclass=pd.get_dummies(full['Pclass'],prefix='Pclass')
pclass.head()
Pclass_1 | Pclass_2 | Pclass_3 | |
---|---|---|---|
0 | 0 | 0 | 1 |
1 | 1 | 0 | 0 |
2 | 0 | 0 | 1 |
3 | 1 | 0 | 0 |
4 | 0 | 0 | 1 |
full=pd.concat([full,pclass],axis=1)
full=full.drop(['Pclass'],axis=1)
embarked=pd.DataFrame()
embarked=pd.get_dummies(full['Embarked'],prefix='Embarked')
embarked.head()
Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|
0 | 0 | 0 | 1 |
1 | 1 | 0 | 0 |
2 | 0 | 0 | 1 |
3 | 0 | 0 | 1 |
4 | 0 | 0 | 1 |
full=pd.concat([full,embarked],axis=1)
full=full.drop(['Embarked'],axis=1)
full.head()
Survived | Sex | Age | SibSp | Parch | Fare | Cabin | Pclass_1 | Pclass_2 | Pclass_3 | Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 1 | 22.0 | 1 | 0 | 7.2500 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
1 | 1.0 | 0 | 38.0 | 1 | 0 | 71.2833 | 1 | 1 | 0 | 0 | 1 | 0 | 0 |
2 | 1.0 | 0 | 26.0 | 0 | 0 | 7.9250 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
3 | 1.0 | 0 | 35.0 | 1 | 0 | 53.1000 | 1 | 1 | 0 | 0 | 0 | 0 | 1 |
4 | 0.0 | 1 | 35.0 | 0 | 0 | 8.0500 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
family=pd.DataFrame()
family['family_size']=full['SibSp']+full['Parch']+1
family['family_sigle']=family['family_size'].map(lambda s: 1 if s==1 else 0)
family['family_small']=family['family_size'].map(lambda s:1 if 2<=s<=4 else 0)
family['family_large']=family['family_size'].map(lambda s:1 if s>=5 else 0)
family.head()
family_size | family_sigle | family_small | family_large | |
---|---|---|---|---|
0 | 2 | 0 | 1 | 0 |
1 | 2 | 0 | 1 | 0 |
2 | 1 | 1 | 0 | 0 |
3 | 2 | 0 | 1 | 0 |
4 | 1 | 1 | 0 | 0 |
full=pd.concat([full,family],axis=1)
full=full.drop(['SibSp','Parch','family_size'],axis=1)
full.head()
Survived | Sex | Age | Fare | Cabin | Pclass_1 | Pclass_2 | Pclass_3 | Embarked_C | Embarked_Q | Embarked_S | family_sigle | family_small | family_large | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 1 | 22.0 | 7.2500 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
1 | 1.0 | 0 | 38.0 | 71.2833 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
2 | 1.0 | 0 | 26.0 | 7.9250 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
3 | 1.0 | 0 | 35.0 | 53.1000 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
4 | 0.0 | 1 | 35.0 | 8.0500 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
age=pd.DataFrame()
age['child']=full['Age'].map(lambda s:1 if 0<s<=6 else 0)
age['teen']=full['Age'].map(lambda s:1 if 6<s<=18 else 0)
age['younth']=full['Age'].map(lambda s:1 if 18<s<=40 else 0)
age['mid']=full['Age'].map(lambda s:1 if 40<s<=60 else 0)
age['old']=full['Age'].map(lambda s:1 if s>60 else 0)
age.head()
child | teen | younth | mid | old | |
---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 |
1 | 0 | 0 | 1 | 0 | 0 |
2 | 0 | 0 | 1 | 0 | 0 |
3 | 0 | 0 | 1 | 0 | 0 |
4 | 0 | 0 | 1 | 0 | 0 |
full=pd.concat([full,age],axis=1)
full=full.drop(['Age'],axis=1)
full.head()
Survived | Sex | Fare | Cabin | Pclass_1 | Pclass_2 | Pclass_3 | Embarked_C | Embarked_Q | Embarked_S | family_sigle | family_small | family_large | child | teen | younth | mid | old | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 1 | 7.2500 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
1 | 1.0 | 0 | 71.2833 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 1.0 | 0 | 7.9250 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
3 | 1.0 | 0 | 53.1000 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 0.0 | 1 | 8.0500 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
import sklearn.preprocessing as preprocessing
scaler = preprocessing.StandardScaler()
fare_scale_param = scaler.fit(full['Fare'].values.reshape(-1,1))
full['Fare'] = scaler.fit_transform(full['Fare'].values.reshape(-1,1), fare_scale_param)
full.head()
Survived | Sex | Fare | Cabin | Pclass_1 | Pclass_2 | Pclass_3 | Embarked_C | Embarked_Q | Embarked_S | family_sigle | family_small | family_large | child | teen | younth | mid | old | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 1 | -0.503176 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
1 | 1.0 | 0 | 0.734809 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 1.0 | 0 | -0.490126 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
3 | 1.0 | 0 | 0.383263 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 0.0 | 1 | -0.487709 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
train=full.loc[:890]
test_=full.loc[891:]
x_train=train.drop(['Survived'],axis=1)
x_train.head()
Sex | Fare | Cabin | Pclass_1 | Pclass_2 | Pclass_3 | Embarked_C | Embarked_Q | Embarked_S | family_sigle | family_small | family_large | child | teen | younth | mid | old | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | -0.503176 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
1 | 0 | 0.734809 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 0 | -0.490126 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
3 | 0 | 0.383263 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 1 | -0.487709 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
y_train=train['Survived'].astype(int)
y_train.head()
0 0
1 1
2 1
3 1
4 0
Name: Survived, dtype: int32
test_=test_.drop(['Survived'],axis=1)
test_.head()
Sex | Fare | Cabin | Pclass_1 | Pclass_2 | Pclass_3 | Embarked_C | Embarked_Q | Embarked_S | family_sigle | family_small | family_large | child | teen | younth | mid | old | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
891 | 1 | -0.491978 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
892 | 0 | -0.508010 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
893 | 1 | -0.456051 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
894 | 1 | -0.475868 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
895 | 0 | -0.405784 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
from sklearn.model_selection import train_test_split#这个模块主要是对数据的分割,以及与数据划分相关的功能 from sklearn.linear_model import LogisticRegression#线性模型,逻辑回归 from sklearn.tree import DecisionTreeClassifier #树模型,决策树 from sklearn.ensemble import RandomForestClassifier#集成模型,随机森林RF from xgboost import XGBClassifier from lightgbm import LGBMClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.svm import SVC from sklearn.model_selection import cross_val_score #交叉验证指标 from sklearn.metrics import confusion_matrix,precision_score,accuracy_score,mean_squared_error,classification_report #各种评价模型效果的指标 #训练集测试集划分 t1_x,t2_x,t1_y,t2_y=train_test_split(x_train,y_train,test_size=0.3,random_state=11) #模型选择 models=[LogisticRegression(),DecisionTreeClassifier(),RandomForestClassifier(), XGBClassifier(),LGBMClassifier(),KNeighborsClassifier(),SVC()]
D:\soft\ANACONDA\lib\site-packages\dask\config.py:168: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
data = yaml.load(f.read()) or {}
D:\soft\ANACONDA\lib\site-packages\distributed\config.py:20: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
defaults = yaml.load(f)
# evaluate models by using cross-validation
names=['LR','Tree','RF','XGBC','LGBC','KNN','SVM']
for name, model in zip(names,models):
score=cross_val_score(model,t1_x,t1_y,cv=5)
print("{}:{},{}".format(name,score.mean(),score))
LR:0.7960387096774193,[0.768 0.832 0.856 0.80645161 0.71774194]
Tree:0.8057419354838709,[0.776 0.808 0.856 0.83064516 0.75806452]
RF:0.7993032258064515,[0.752 0.832 0.848 0.80645161 0.75806452]
XGBC:0.8137290322580645,[0.8 0.832 0.856 0.80645161 0.77419355]
LGBC:0.8217806451612903,[0.808 0.824 0.864 0.80645161 0.80645161]
KNN:0.7799870967741935,[0.76 0.816 0.832 0.7983871 0.69354839]
SVM:0.8073161290322581,[0.792 0.84 0.832 0.81451613 0.75806452]
# evaluate models by using cross-validation
names=['LR','Tree','RF','XGBC','LGBC','KNN','SVM']
for name, model in zip(names,models):
score=cross_val_score(model,t2_x,t2_y,cv=5)
print("{}:{},{}".format(name,score.mean(),score))
LR:0.8354996505939901,[0.83333333 0.90740741 0.85185185 0.83018868 0.75471698]
Tree:0.7908455625436759,[0.7962963 0.85185185 0.77777778 0.73584906 0.79245283]
RF:0.7986722571628231,[0.77777778 0.7962963 0.77777778 0.79245283 0.8490566 ]
XGBC:0.7986023759608665,[0.81481481 0.81481481 0.74074074 0.77358491 0.8490566 ]
LGBC:0.8245981830887491,[0.77777778 0.90740741 0.7962963 0.81132075 0.83018868]
KNN:0.801956673654787,[0.83333333 0.85185185 0.7962963 0.71698113 0.81132075]
SVM:0.8579315164220824,[0.85185185 0.92592593 0.87037037 0.81132075 0.83018868]
from sklearn.ensemble import VotingClassifier
LR = LogisticRegression()
Tree = DecisionTreeClassifier()
RF = RandomForestClassifier()
XGBC = XGBClassifier()
LGBC = LGBMClassifier()
KNN = KNeighborsClassifier()
SVM = SVC()
eclf=VotingClassifier([('LR',LR),('Tree',Tree),('RF',RF),('XGBC',XGBC),('LGBC',LGBC),
('KNN',KNN),('SVM',SVM)],voting='hard',n_jobs=-1)
eclf.fit(t1_x,t1_y)
eclf.score(t2_x,t2_y)
0.8582089552238806
result=eclf.predict(test_)
submission = pd.DataFrame({
"PassengerId": test["PassengerId"],
"Survived": result
})
submission.to_csv('submission.csv', index=False)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。