赞
踩
这是kaggle上Getting Started 的Prediction Competition,也是比较入门和简单的新人赛,我的最好成绩好像有进入top8%,重新地回顾巩固一下这个比赛,我将分成三个部分:
赛题地址:Titanic: Machine Learning from Disaster
1912年4月15日,在她的处女航中,被普遍认为“沉没”的RMS泰坦尼克号与冰山相撞后沉没。
不幸的是,船上没有足够的救生艇供所有人使用,导致2224名乘客和机组人员中的1502人死亡。虽然幸存有一些运气,但似乎有些人比其他人更有可能生存。
在这一挑战中,我们要求您建立一个预测模型来回答以下问题:“什么样的人更有可能生存?” 使用乘客数据(即姓名,年龄,性别,社会经济舱等)
任务分析:这是一个分类任务,建立模型预测幸存者
特征工程很重要:数据和特征决定了机器学习的上限,而模型和算法只是逼近这个上限而已。
train_data['train'] = 1
test_data['train'] = 0
data_all =pd.concat([train_data,test_data],sort=True).reset_index(drop=True)
从对连续性数据特征分析过后,可以发现其在数据分布不同段,是存在有区别的,为构建是否合理的特征,再对其进行深入的分析
对Fare进行区间划分之后,可以明显从区间的生存率看出,船票越高的游客幸存的机会更大
## 提取幸存的和未幸存的 survived = train_data['Survived'] == 1 max_fare = train_data['Fare'].max() print('max_fare:',max_fare) Fare_survived=train_data[survived]['Fare'] Fare_Notsurvived=train_data[~survived]['Fare'] ## 幸存 数据分桶:由于分布不均,在不同的区间选择了不同的划分 cut_list = list(range(0,100,10)) cut_list.extend(list(range(100,610,100))) Fare_survived_cut=pd.cut(Fare_survived,cut_list,right=False) Fare_survived_counts=Fare_survived_cut.value_counts(sort= False) ## 未幸存 数据分桶:由于分布不均,在不同的区间选择了不同的划分 Fare_Notsurvived_cut=pd.cut(Fare_Notsurvived,cut_list,right=False) Fare_Notsurvived_counts=Fare_Notsurvived_cut.value_counts(sort=False) ## 幸存+未幸存 Passenger_Fare_counts = Fare_survived_counts+Fare_Notsurvived_counts RateOfFare_survived_counts = Fare_survived_counts/Passenger_Fare_counts plt.figure(figsize=(10,10)) plt.subplots_adjust(right=2,hspace =0.5) plt.subplot(2,2,1) Fare_survived_counts.plot.bar() plt.title('Distribution of Survived in Fare interval ',size=15) plt.ylabel('Passenger Count',size=15) plt.subplot(2,2,2) Fare_Notsurvived_counts.plot.bar() plt.title('Distribution of Unsurvived in Fare interval ',size=15) plt.ylabel('Passenger Count',size=15) plt.subplot(2,2,3) Passenger_Fare_counts.plot.bar() plt.title('Distribution of All Passenger in Fare interval ',size=15) plt.ylabel('Passenger Count',size=15) plt.subplot(2,2,4) RateOfFare_survived_counts.plot.bar(color='r') plt.title('Distribution of Rate_survived in Fare interval ',size=15) plt.ylabel('Rate Survived',size=15) plt.show()
max_fare: 512.3292
对Age划分区间进行分析后,发现幼儿的生存机会似乎更大
## 提取幸存的和未幸存的 survived = train_data['Survived'] == 1 max_age = train_data['Age'].max() print('max_age:',max_age) Age_survived=train_data[survived]['Age'] Age_Notsurvived=train_data[~survived]['Age'] ## 幸存 ## 未成年进行了细分 cut_list = [0,4,8,12,18,30,40,55,65,80] Age_survived_cut=pd.cut(Age_survived,cut_list,right=True) Age_survived_counts=Age_survived_cut.value_counts(sort=False) ## 未幸存 Age_Notsurvived_cut=pd.cut(Age_Notsurvived,cut_list,right=True) Age_Notsurvived_counts=Age_Notsurvived_cut.value_counts(sort=False) ## 幸存+未幸存 Passenger_Age_counts = Age_survived_counts+Age_Notsurvived_counts RateOfAge_survived_counts = Age_survived_counts/Passenger_Age_counts plt.figure(figsize=(10,10)) plt.subplots_adjust(right=2,hspace =0.5) plt.subplot(2,2,1) Age_survived_counts.plot.bar() plt.title('Distribution of Survived in Age interval ',size=15) plt.ylabel('Passenger Count',size=15) plt.subplot(2,2,2) Age_Notsurvived_counts.plot.bar() plt.title('Distribution of Unsurvived in Age interval ',size=15) plt.ylabel('Passenger Count',size=15) plt.subplot(2,2,3) Passenger_Age_counts.plot.bar() plt.title('Distribution of All Passenger in Age interval ',size=15) plt.ylabel('Passenger Count',size=15) plt.subplot(2,2,4) RateOfAge_survived_counts.plot.bar(color='r') plt.title('Distribution of Rate_survived in Age interval ',size=15) plt.ylabel('Rate Survived',size=15) plt.show()
max_age: 80.0
对连续的特征进行离散化,得到粗粒度的特征
对Fare,Age分桶操作
为什么要分桶或连续性数据做离散化的作用:
cut_list = list(range(0,100,10))
cut_list.extend(list(range(100,700,100)))
Fare_cut=pd.cut(data_all['Fare'],cut_list,labels=False,right=False)
data_all['Fare_bin']=Fare_cut
cut_list = [0,4,8,12,18,30,40,55,65,80]
Age_cut=pd.cut(data_all['Age'],cut_list,labels=False)
data_all['Age_bin']=Age_cut
## 查看数据
data_all[['Age','Age_bin','Fare','Fare_bin']].head()
Age | Age_bin | Fare | Fare_bin | |
---|---|---|---|---|
0 | 22.0 | 4 | 7.2500 | 0 |
1 | 38.0 | 5 | 71.2833 | 7 |
2 | 26.0 | 4 | 7.9250 | 0 |
3 | 35.0 | 5 | 53.1000 | 5 |
4 | 35.0 | 5 | 8.0500 | 0 |
SibSp和Parch的幸存率大致分布一致,可以考虑将他们合并成一个Family(SibSp+Parch+1)特征
survived = train_data['Survived'] == 1
train_data['Family']=train_data['SibSp']+train_data['Parch']+1
sns.countplot(x='Family', hue='Survived', data=train_data)
plt.title('Count of Survival in Family Feature',size=15)
SibSp和Parch的幸存率大致分布一致,可以考虑将他们合并成一个Family(SibSp+Parch+1)特征
data_all['Family']=data_all['SibSp']+data_all['Parch']+1
Family_category_map = {1: 'Single', 2: 'Small', 3: 'Small', 4: 'Small', 5: 'Medium', 6: 'Medium', 7: 'Medium', 8: 'Large', 11: 'Large'}
data_all['Family_category'] = data_all['Family'].map(Family_category_map)
data_all[['Family_category','Family']].head()
Family_category | Family | |
---|---|---|
0 | Small | 2 |
1 | Small | 2 |
2 | Single | 1 |
3 | Small | 2 |
4 | Single | 1 |
Name会包含乘客的一些性别信息,甚至是已婚、未婚或者身份地位
data_all['Title'] = data_all['Name'].apply(lambda x:x.split(',')[1].split('.')[0].strip()) Title_Dictionary = { "Capt": "Officer", "Col": "Officer", "Major": "Officer", "Jonkheer": "Royalty", "Don": "Royalty", "Sir" : "Royalty", "Dr": "Officer", "Rev": "Officer", "the Countess":"Royalty", "Mme": "Mrs", "Mlle": "Miss", "Ms": "Mrs", "Mr" : "Mr", "Mrs" : "Mrs", "Miss" : "Miss", "Master" : "Master", "Lady" : "Royalty" } data_all['Title'] = data_all['Title'].map(Title_Dictionary) data_all[['Name','Title']].head()
Name | Title | |
---|---|---|
0 | Braund, Mr. Owen Harris | Mr |
1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | Mrs |
2 | Heikkinen, Miss. Laina | Miss |
3 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | Mrs |
4 | Allen, Mr. William Henry | Mr |
Ticket中,由于一张Ticket可以供多人使用,由于Cabin信息缺失过多,直接做了删除处理,把船票信息也可以提供乘客是成团的这一先验知识
Ticket_Count = dict(data_all['Ticket'].value_counts())
data_all['TicketGroup'] = data_all['Ticket'].apply(lambda x:Ticket_Count[x])
sns.barplot(x='TicketGroup', y='Survived', data=data_all)
对离散特征进行OneHot编码处理:
1. 使用one-hot编码,将离散特征的取值扩展到了欧式空间,离散特征的某个取值就对应欧式空间的某个点。
2. 将离散特征通过one-hot编码映射到欧式空间,是因为,在回归,分类,聚类等机器学习算法中,特征之间距离的计算或相似度的计算是非常重要的,而我们常用的距离或相似度的计算都是在欧式空间的相似度计算,计算余弦相似性,基于的就是欧式空间。
3. 离散型特征使用one-hot编码,会让特征之间的距离计算更加合理
## 处理类别特征 根据自己的需求定义
## 对离散化特征进行onehot处理,并且删除对应的处理之前特征
def process_category_feature(data,category_feature=None):
for feature in category_feature:
# onehot
feature_dummies = pd.get_dummies(data[feature], prefix=feature)
data = pd.concat([data, feature_dummies],axis=1)
## 删除
data.drop(feature,axis=1,inplace=True)
return data
data_all=process_category_feature(data_all,category_feature=['Pclass','Embarked','Title','Family_category','Fare_bin','Age_bin','Sex'])
删除之前的特征
data_all.drop(['Name','Ticket','SibSp','Parch'],axis=1,inplace=True)
#对数据进行归一化处理
import sklearn.preprocessing as preprocessing
scaler = preprocessing.StandardScaler()
##Age
age_scale = scaler.fit(data_all['Age'].values.reshape(-1, 1))
data_all['Age'] = age_scale.transform(data_all['Age'].values.reshape(-1, 1))
##Fare
fare_scale = scaler.fit(data_all['Fare'].values.reshape(-1, 1))
data_all['Fare'] = scaler.transform(data_all['Fare'].values.reshape(-1, 1))
简单的特征工程暂时完成,数据稍作提取就能输入到模型
Survived 特征的数据类型这里变成了np.float64,之后需要简单转换成np.int8
data_all.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1309 entries, 0 to 1308 Data columns (total 46 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 1309 non-null float64 1 Fare 1309 non-null float64 2 Survived 891 non-null float64 3 train 1309 non-null int64 4 Family 1309 non-null int64 5 TicketGroup 1309 non-null int64 6 Pclass_1 1309 non-null uint8 7 Pclass_2 1309 non-null uint8 8 Pclass_3 1309 non-null uint8 9 Embarked_0 1309 non-null uint8 10 Embarked_1 1309 non-null uint8 11 Embarked_2 1309 non-null uint8 12 Title_Master 1309 non-null uint8 13 Title_Miss 1309 non-null uint8 14 Title_Mr 1309 non-null uint8 15 Title_Mrs 1309 non-null uint8 16 Title_Officer 1309 non-null uint8 17 Title_Royalty 1309 non-null uint8 18 Family_category_Large 1309 non-null uint8 19 Family_category_Medium 1309 non-null uint8 20 Family_category_Single 1309 non-null uint8 21 Family_category_Small 1309 non-null uint8 22 Fare_bin_0 1309 non-null uint8 23 Fare_bin_1 1309 non-null uint8 24 Fare_bin_2 1309 non-null uint8 25 Fare_bin_3 1309 non-null uint8 26 Fare_bin_4 1309 non-null uint8 27 Fare_bin_5 1309 non-null uint8 28 Fare_bin_6 1309 non-null uint8 29 Fare_bin_7 1309 non-null uint8 30 Fare_bin_8 1309 non-null uint8 31 Fare_bin_9 1309 non-null uint8 32 Fare_bin_10 1309 non-null uint8 33 Fare_bin_11 1309 non-null uint8 34 Fare_bin_14 1309 non-null uint8 35 Age_bin_0 1309 non-null uint8 36 Age_bin_1 1309 non-null uint8 37 Age_bin_2 1309 non-null uint8 38 Age_bin_3 1309 non-null uint8 39 Age_bin_4 1309 non-null uint8 40 Age_bin_5 1309 non-null uint8 41 Age_bin_6 1309 non-null uint8 42 Age_bin_7 1309 non-null uint8 43 Age_bin_8 1309 non-null uint8 44 Sex_0 1309 non-null uint8 45 Sex_1 1309 non-null uint8 dtypes: float64(3), int64(3), uint8(40) memory usage: 112.6 KB
data_all.to_csv('./data_feature_engnieering.csv',index=False)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。