当前位置:   article > 正文

【金融风控-贷款违约预测】数据挖掘学习:3.特征工程_this could be a false alarm, with some parameters

this could be a false alarm, with some parameters getting used by language b

目录

学习目标

内容介绍

代码示例

导入包并读取数据

特征预处理

缺失值填充

时间格式处理

对象类型特征转换到数值

类别特征处理

异常值处理

检测异常的方法一:均方差

检测异常的方法二:箱型图

数据分桶

固定宽度分桶

分位数分桶

特征交互

特征编码

labelEncode 直接放入树模型中

逻辑回归等模型要单独增加的特征工程

特征选择

Filter

Wrapper (Recursive feature elimination,RFE)

Embedded

总结


学习目标

  • 学习特征预处理、缺失值、异常值处理、数据分桶等特征处理方法;
  • 学习特征交互、编码、选择的相应方法;

内容介绍

  • 数据预处理
    • 缺失值的填充
    • 时间格式处理
    • 对象类型特征转换到数值
  • 异常值处理
    • 基于3segama原则
    • 基于箱型图
  • 数据分箱
    • 固定宽度分箱
    • 分位数分箱
      • 离散数值型数据分箱
      • 连续数值型数据分箱
    • 卡方分箱
  • 特征交互
    • 特征和特征之间组合
    • 特征和特征之间衍生
    • 其他特征衍生的尝试
  • 特征编码
    • one-hot编码
    • label-encode编码
  • 特征选择
    • 1 Filter
    • 2 Wrapper (RFE)
    • 3 Embedded

代码示例

导入包并读取数据

  1. import pandas as pd
  2. import numpy as np
  3. import matplotlib.pyplot as plt
  4. import seaborn as sns
  5. import datetime
  6. from tqdm import tqdm
  7. from sklearn.preprocessing import LabelEncoder
  8. from sklearn.feature_selection import SelectKBest
  9. from sklearn.feature_selection import chi2
  10. from sklearn.preprocessing import MinMaxScaler
  11. import xgboost as xgb
  12. import lightgbm as lgb
  13. from catboost import CatBoostRegressor
  14. import warnings
  15. from sklearn.model_selection import StratifiedKFold, KFold
  16. from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
  17. warnings.filterwarnings('ignore')
  1. data_train = pd.read_csv('../train.csv')
  2. data_test_a = pd.read_csv('../testA.csv')

特征预处理

  • 数据EDA部分我们已经了解数据的大概和某些特征分布了,数据预处理部分一般我们要处理一些EDA阶段分析出来的问题,这里介绍了数据缺失值的填充时间格式特征的转化处理某些对象类别特征的处理
  • 首先我们查找出数据中的类别特征和数值特征:
  1. numerical_fea = list(data_train.select_dtypes(exclude=['object']).columns) # 数值特征
  2. category_fea = list(filter(lambda x: x not in numerical_fea, list(data_train.columns))) # 类别特征
  3. label = 'isDefault' # 目标特征
  4. numerical_fea.remove(label)
  • 在比赛中数据预处理是必不可少的一部分,对于缺失值的填充往往会影响比赛的结果,在比赛中不妨尝试多种填充然后比较结果选择结果最优的一种; 比赛数据相比真实场景的数据相对要“干净”一些,但是还是会有一定的“脏”数据存在,清洗一些异常值往往会获得意想不到的效果。

缺失值填充

  • 把所有缺失值替换为指定的值0:
data_train = data_train.fillna(0)
  • 使用缺失值上面的值替换缺失值:
data_train = data_train.fillna(axis=0, method='ffill')
  • 纵向用缺失值下面的值替换缺失值,且设置最多只填充两个连续的缺失值:
data_train = data_train.fillna(axis=0, method='bfill', limit=2)
  • 查看缺失值情况:
  1. data_train.isnull().sum()
  2. # id 0
  3. # loanAmnt 0
  4. # term 0
  5. # interestRate 0
  6. # installment 0
  7. # grade 0
  8. # subGrade 0
  9. # employmentTitle 1
  10. # employmentLength 46799
  11. # homeOwnership 0
  12. # annualIncome 0
  13. # verificationStatus 0
  14. # issueDate 0
  15. # isDefault 0
  16. # purpose 0
  17. # postCode 1
  18. # regionCode 0
  19. # dti 239
  20. # delinquency_2years 0
  21. # ficoRangeLow 0
  22. # ficoRangeHigh 0
  23. # openAcc 0
  24. # pubRec 0
  25. # pubRecBankruptcies 405
  26. # revolBal 0
  27. # revolUtil 531
  28. # totalAcc 0
  29. # initialListStatus 0
  30. # applicationType 0
  31. # earliesCreditLine 0
  32. # title 1
  33. # policyCode 0
  34. # n0 40270
  35. # n1 40270
  36. # n2 40270
  37. # n2.1 40270
  38. # n4 33239
  39. # n5 40270
  40. # n6 40270
  41. # n7 40270
  42. # n8 40271
  43. # n9 40270
  44. # n10 33239
  45. # n11 69752
  46. # n12 40270
  47. # n13 40270
  48. # n14 40270
  49. # dtype: int64
  1. # 按照中值填充数值型特征
  2. data_train[numerical_fea] = data_train[numerical_fea].fillna(data_train[numerical_fea].median())
  3. data_test_a[numerical_fea] = data_test_a[numerical_fea].fillna(data_train[numerical_fea].median())
  4. # 按照众数填充类别型特征
  5. data_train[category_fea] = data_train[category_fea].fillna(data_train[category_fea].mode())
  6. data_test_a[category_fea] = data_test_a[category_fea].fillna(data_train[category_fea].mode())
  1. data_train.isnull().sum()
  2. # id 0
  3. # loanAmnt 0
  4. # term 0
  5. # interestRate 0
  6. # installment 0
  7. # grade 0
  8. # subGrade 0
  9. # employmentTitle 0
  10. # employmentLength 46799
  11. # homeOwnership 0
  12. # annualIncome 0
  13. # verificationStatus 0
  14. # issueDate 0
  15. # isDefault 0
  16. # purpose 0
  17. # postCode 0
  18. # regionCode 0
  19. # dti 0
  20. # delinquency_2years 0
  21. # ficoRangeLow 0
  22. # ficoRangeHigh 0
  23. # openAcc 0
  24. # pubRec 0
  25. # pubRecBankruptcies 0
  26. # revolBal 0
  27. # revolUtil 0
  28. # totalAcc 0
  29. # initialListStatus 0
  30. # applicationType 0
  31. # earliesCreditLine 0
  32. # title 0
  33. # policyCode 0
  34. # n0 0
  35. # n1 0
  36. # n2 0
  37. # n2.1 0
  38. # n4 0
  39. # n5 0
  40. # n6 0
  41. # n7 0
  42. # n8 0
  43. # n9 0
  44. # n10 0
  45. # n11 0
  46. # n12 0
  47. # n13 0
  48. # n14 0
  49. # dtype: int64

时间格式处理

  1. # 转化成时间格式
  2. for data in [data_train, data_test_a]:
  3. data['issueDate'] = pd.to_datetime(data['issueDate'], format='%Y-%m-%d')
  4. startdate = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
  5. # 构造时间特征
  6. data['issueDateDT'] = data['issueDate'].apply(lambda x: x - startdate).dt.days
  1. data_train['employmentLength'].value_counts(dropna=False).sort_index()
  2. # 1 year 52489
  3. # 10+ years 262753
  4. # 2 years 72358
  5. # 3 years 64152
  6. # 4 years 47985
  7. # 5 years 50102
  8. # 6 years 37254
  9. # 7 years 35407
  10. # 8 years 36192
  11. # 9 years 30272
  12. # < 1 year 64237
  13. # NaN 46799
  14. # Name: employmentLength, dtype: int64

对象类型特征转换到数值

  1. def employmentLength_to_int(s):
  2. if pd.isnull(s):
  3. return s
  4. else:
  5. return np.int8(s.split()[0])
  6. for data in [data_train, data_test_a]:
  7. data['employmentLength'].replace(to_replace='10+ years', value='10 years', inplace=True)
  8. data['employmentLength'].replace('< 1 year', '0 years', inplace=True)
  9. data['employmentLength'] = data['employmentLength'].apply(employmentLength_to_int)
  10. data['employmentLength'].value_counts(dropna=False).sort_index()
  11. # 0.0 15989
  12. # 1.0 13182
  13. # 2.0 18207
  14. # 3.0 16011
  15. # 4.0 11833
  16. # 5.0 12543
  17. # 6.0 9328
  18. # 7.0 8823
  19. # 8.0 8976
  20. # 9.0 7594
  21. # 10.0 65772
  22. # NaN 11742
  23. # Name: employmentLength, dtype: int64
  • 对earliesCreditLine进行预处理:
  1. data_train['earliesCreditLine'].sample(5)
  2. # 642880 Jun-1992
  3. # 77423 Aug-1983
  4. # 356008 Mar-1999
  5. # 84346 Aug-2007
  6. # 574182 Sep-2005
  7. # Name: earliesCreditLine, dtype: object
  1. for data in [data_train, data_test_a]:
  2. data['earliesCreditLine'] = data['earliesCreditLine'].apply(lambda s: int(s[-4:]))

类别特征处理

  1. # 部分类别特征
  2. cate_features = ['grade', 'subGrade', 'employmentTitle', 'homeOwnership', 'verificationStatus', 'purpose', 'postCode', 'regionCode', \
  3. 'applicationType', 'initialListStatus', 'title', 'policyCode']
  4. for f in cate_features:
  5. print(f, '类型数:', data_train[f].nunique())
  6. # grade 类型数: 7
  7. # subGrade 类型数: 35
  8. # employmentTitle 类型数: 248683
  9. # homeOwnership 类型数: 6
  10. # verificationStatus 类型数: 3
  11. # purpose 类型数: 14
  12. # postCode 类型数: 932
  13. # regionCode 类型数: 51
  14. # applicationType 类型数: 2
  15. # initialListStatus 类型数: 2
  16. # title 类型数: 39644
  17. # policyCode 类型数: 1
  • 像等级这种类别特征,是有优先级的可以labelencode或者自映射:
  1. for data in [data_train, data_test_a]:
  2. data['grade'] = data['grade'].map({'A': 1,'B': 2,'C': 3,'D': 4,'E': 5,'F': 6,'G': 7})
  1. # 类型数在2之上,又不是高维稀疏的,且纯分类特征
  2. for data in [data_train, data_test_a]:
  3. data = pd.get_dummies(data, columns=['subGrade', 'homeOwnership', 'verificationStatus', 'purpose', 'regionCode'], drop_first=True)

异常值处理

  • 当你发现异常值后,一定要先分清是什么原因导致的异常值,然后再考虑如何处理。首先,如果这一异常值并不代表一种规律性的,而是极其偶然的现象,或者说你并不想研究这种偶然的现象,这时可以将其删除。其次,如果异常值存在且代表了一种真实存在的现象,那就不能随便删除。在现有的欺诈场景中很多时候欺诈数据本身相对于正常数据来说就是异常的,我们要把这些异常点纳入,重新拟合模型,研究其规律。能用监督的用监督模型,不能用的还可以考虑用异常检测的算法来做。
  • 注意test的数据不能删。

检测异常的方法一:均方差

  • 在统计学中,如果一个数据分布近似正态,那么大约 68% 的数据值会在均值的一个标准差范围内,大约 95% 会在两个标准差范围内,大约 99.7% 会在三个标准差范围内。
  • 得到特征的异常值后可以进一步分析变量异常值和目标变量的关系:
  1. def find_outliers_by_3segama(data, fea):
  2. data_std = np.std(data[fea])
  3. data_mean = np.mean(data[fea])
  4. outliers_cut_off = data_std * 3
  5. lower_rule = data_mean - outliers_cut_off
  6. upper_rule = data_mean + outliers_cut_off
  7. data[fea + '_outliers'] = data[fea].apply(lambda x: str('异常值') if x > upper_rule or x < lower_rule else '正常值')
  8. return data
  9. data_train = data_train.copy()
  10. for fea in numerical_fea:
  11. data_train = find_outliers_by_3segama(data_train, fea)
  12. print(data_train[fea + '_outliers'].value_counts())
  13. print(data_train.groupby(fea + '_outliers')['isDefault'].sum())
  14. print('*' * 10)
  15. # 正常值 800000
  16. # Name: id_outliers, dtype: int64
  17. # id_outliers
  18. # 正常值 159610
  19. # Name: isDefault, dtype: int64
  20. # **********
  21. # 正常值 800000
  22. # Name: loanAmnt_outliers, dtype: int64
  23. # loanAmnt_outliers
  24. # 正常值 159610
  25. # Name: isDefault, dtype: int64
  26. # **********
  27. # 正常值 800000
  28. # Name: term_outliers, dtype: int64
  29. # term_outliers
  30. # 正常值 159610
  31. # Name: isDefault, dtype: int64
  32. # **********
  33. # 正常值 794259
  34. # 异常值 5741
  35. # Name: interestRate_outliers, dtype: int64
  36. # interestRate_outliers
  37. # 异常值 2916
  38. # 正常值 156694
  39. # Name: isDefault, dtype: int64
  40. # **********
  41. # 正常值 792046
  42. # 异常值 7954
  43. # Name: installment_outliers, dtype: int64
  44. # installment_outliers
  45. # 异常值 2152
  46. # 正常值 157458
  47. # Name: isDefault, dtype: int64
  48. # **********
  49. # 正常值 800000
  50. # Name: employmentTitle_outliers, dtype: int64
  51. # employmentTitle_outliers
  52. # 正常值 159610
  53. # Name: isDefault, dtype: int64
  54. # **********
  55. # 正常值 799701
  56. # 异常值 299
  57. # Name: homeOwnership_outliers, dtype: int64
  58. # homeOwnership_outliers
  59. # 异常值 62
  60. # 正常值 159548
  61. # Name: isDefault, dtype: int64
  62. # **********
  63. # 正常值 793973
  64. # 异常值 6027
  65. # Name: annualIncome_outliers, dtype: int64
  66. # annualIncome_outliers
  67. # 异常值 756
  68. # 正常值 158854
  69. # Name: isDefault, dtype: int64
  70. # **********
  71. # 正常值 800000
  72. # Name: verificationStatus_outliers, dtype: int64
  73. # verificationStatus_outliers
  74. # 正常值 159610
  75. # Name: isDefault, dtype: int64
  76. # **********
  77. # 正常值 783003
  78. # 异常值 16997
  79. # Name: purpose_outliers, dtype: int64
  80. # purpose_outliers
  81. # 异常值 3635
  82. # 正常值 155975
  83. # Name: isDefault, dtype: int64
  84. # **********
  85. # 正常值 798931
  86. # 异常值 1069
  87. # Name: postCode_outliers, dtype: int64
  88. # postCode_outliers
  89. # 异常值 221
  90. # 正常值 159389
  91. # Name: isDefault, dtype: int64
  92. # **********
  93. # 正常值 799994
  94. # 异常值 6
  95. # Name: regionCode_outliers, dtype: int64
  96. # regionCode_outliers
  97. # 异常值 1
  98. # 正常值 159609
  99. # Name: isDefault, dtype: int64
  100. # **********
  101. # 正常值 798440
  102. # 异常值 1560
  103. # Name: dti_outliers, dtype: int64
  104. # dti_outliers
  105. # 异常值 466
  106. # 正常值 159144
  107. # Name: isDefault, dtype: int64
  108. # **********
  109. # 正常值 778245
  110. # 异常值 21755
  111. # Name: delinquency_2years_outliers, dtype: int64
  112. # delinquency_2years_outliers
  113. # 异常值 5089
  114. # 正常值 154521
  115. # Name: isDefault, dtype: int64
  116. # **********
  117. # 正常值 788261
  118. # 异常值 11739
  119. # Name: ficoRangeLow_outliers, dtype: int64
  120. # ficoRangeLow_outliers
  121. # 异常值 778
  122. # 正常值 158832
  123. # Name: isDefault, dtype: int64
  124. # **********
  125. # 正常值 788261
  126. # 异常值 11739
  127. # Name: ficoRangeHigh_outliers, dtype: int64
  128. # ficoRangeHigh_outliers
  129. # 异常值 778
  130. # 正常值 158832
  131. # Name: isDefault, dtype: int64
  132. # **********
  133. # 正常值 790889
  134. # 异常值 9111
  135. # Name: openAcc_outliers, dtype: int64
  136. # openAcc_outliers
  137. # 异常值 2195
  138. # 正常值 157415
  139. # Name: isDefault, dtype: int64
  140. # **********
  141. # 正常值 792471
  142. # 异常值 7529
  143. # Name: pubRec_outliers, dtype: int64
  144. # pubRec_outliers
  145. # 异常值 1701
  146. # 正常值 157909
  147. # Name: isDefault, dtype: int64
  148. # **********
  149. # 正常值 794120
  150. # 异常值 5880
  151. # Name: pubRecBankruptcies_outliers, dtype: int64
  152. # pubRecBankruptcies_outliers
  153. # 异常值 1423
  154. # 正常值 158187
  155. # Name: isDefault, dtype: int64
  156. # **********
  157. # 正常值 790001
  158. # 异常值 9999
  159. # Name: revolBal_outliers, dtype: int64
  160. # revolBal_outliers
  161. # 异常值 1359
  162. # 正常值 158251
  163. # Name: isDefault, dtype: int64
  164. # **********
  165. # 正常值 799948
  166. # 异常值 52
  167. # Name: revolUtil_outliers, dtype: int64
  168. # revolUtil_outliers
  169. # 异常值 23
  170. # 正常值 159587
  171. # Name: isDefault, dtype: int64
  172. # **********
  173. # 正常值 791663
  174. # 异常值 8337
  175. # Name: totalAcc_outliers, dtype: int64
  176. # totalAcc_outliers
  177. # 异常值 1668
  178. # 正常值 157942
  179. # Name: isDefault, dtype: int64
  180. # **********
  181. # 正常值 800000
  182. # Name: initialListStatus_outliers, dtype: int64
  183. # initialListStatus_outliers
  184. # 正常值 159610
  185. # Name: isDefault, dtype: int64
  186. # **********
  187. # 正常值 784586
  188. # 异常值 15414
  189. # Name: applicationType_outliers, dtype: int64
  190. # applicationType_outliers
  191. # 异常值 3875
  192. # 正常值 155735
  193. # Name: isDefault, dtype: int64
  194. # **********
  195. # 正常值 775134
  196. # 异常值 24866
  197. # Name: title_outliers, dtype: int64
  198. # title_outliers
  199. # 异常值 3900
  200. # 正常值 155710
  201. # Name: isDefault, dtype: int64
  202. # **********
  203. # 正常值 800000
  204. # Name: policyCode_outliers, dtype: int64
  205. # policyCode_outliers
  206. # 正常值 159610
  207. # Name: isDefault, dtype: int64
  208. # **********
  209. # 正常值 782773
  210. # 异常值 17227
  211. # Name: n0_outliers, dtype: int64
  212. # n0_outliers
  213. # 异常值 3485
  214. # 正常值 156125
  215. # Name: isDefault, dtype: int64
  216. # **********
  217. # 正常值 790500
  218. # 异常值 9500
  219. # Name: n1_outliers, dtype: int64
  220. # n1_outliers
  221. # 异常值 2491
  222. # 正常值 157119
  223. # Name: isDefault, dtype: int64
  224. # **********
  225. # 正常值 789067
  226. # 异常值 10933
  227. # Name: n2_outliers, dtype: int64
  228. # n2_outliers
  229. # 异常值 3205
  230. # 正常值 156405
  231. # Name: isDefault, dtype: int64
  232. # **********
  233. # 正常值 789067
  234. # 异常值 10933
  235. # Name: n2.1_outliers, dtype: int64
  236. # n2.1_outliers
  237. # 异常值 3205
  238. # 正常值 156405
  239. # Name: isDefault, dtype: int64
  240. # **********
  241. # 正常值 788660
  242. # 异常值 11340
  243. # Name: n4_outliers, dtype: int64
  244. # n4_outliers
  245. # 异常值 2476
  246. # 正常值 157134
  247. # Name: isDefault, dtype: int64
  248. # **********
  249. # 正常值 790355
  250. # 异常值 9645
  251. # Name: n5_outliers, dtype: int64
  252. # n5_outliers
  253. # 异常值 1858
  254. # 正常值 157752
  255. # Name: isDefault, dtype: int64
  256. # **********
  257. # 正常值 786006
  258. # 异常值 13994
  259. # Name: n6_outliers, dtype: int64
  260. # n6_outliers
  261. # 异常值 3182
  262. # 正常值 156428
  263. # Name: isDefault, dtype: int64
  264. # **********
  265. # 正常值 788430
  266. # 异常值 11570
  267. # Name: n7_outliers, dtype: int64
  268. # n7_outliers
  269. # 异常值 2746
  270. # 正常值 156864
  271. # Name: isDefault, dtype: int64
  272. # **********
  273. # 正常值 789625
  274. # 异常值 10375
  275. # Name: n8_outliers, dtype: int64
  276. # n8_outliers
  277. # 异常值 2131
  278. # 正常值 157479
  279. # Name: isDefault, dtype: int64
  280. # **********
  281. # 正常值 786384
  282. # 异常值 13616
  283. # Name: n9_outliers, dtype: int64
  284. # n9_outliers
  285. # 异常值 3953
  286. # 正常值 155657
  287. # Name: isDefault, dtype: int64
  288. # **********
  289. # 正常值 788979
  290. # 异常值 11021
  291. # Name: n10_outliers, dtype: int64
  292. # n10_outliers
  293. # 异常值 2639
  294. # 正常值 156971
  295. # Name: isDefault, dtype: int64
  296. # **********
  297. # 正常值 799434
  298. # 异常值 566
  299. # Name: n11_outliers, dtype: int64
  300. # n11_outliers
  301. # 异常值 112
  302. # 正常值 159498
  303. # Name: isDefault, dtype: int64
  304. # **********
  305. # 正常值 797585
  306. # 异常值 2415
  307. # Name: n12_outliers, dtype: int64
  308. # n12_outliers
  309. # 异常值 545
  310. # 正常值 159065
  311. # Name: isDefault, dtype: int64
  312. # **********
  313. # 正常值 788907
  314. # 异常值 11093
  315. # Name: n13_outliers, dtype: int64
  316. # n13_outliers
  317. # 异常值 2482
  318. # 正常值 157128
  319. # Name: isDefault, dtype: int64
  320. # **********
  321. # 正常值 788884
  322. # 异常值 11116
  323. # Name: n14_outliers, dtype: int64
  324. # n14_outliers
  325. # 异常值 3364
  326. # 正常值 156246
  327. # Name: isDefault, dtype: int64
  328. # **********
  1. # 删除异常值
  2. for fea in numerical_fea:
  3. data_train = data_train[data_train[fea + '_outliers'] == '正常值']
  4. data_train = data_train.reset_index(drop=True)

检测异常的方法二:箱型图

  • 总结一句话:四分位数会将数据分为三个点和四个区间,IQR = Q3 -Q1,下触须=Q1 − 1.5x IQR,上触须=Q3 + 1.5x IQR;

数据分桶

  • 特征分桶的目的:
    • 从模型效果上来看,特征分桶主要是为了降低变量的复杂性,减少变量噪音对模型的影响,提高自变量和因变量的相关度。从而使模型更加稳定。
  • 数据分桶的对象:
    • 将连续变量离散化
    • 将多状态的离散变量合并成少状态
  • 分桶的原因:
    • 数据特征内的值跨度可能比较大,对有监督和无监督中如k-均值聚类它使用欧氏距离作为相似度函数来测量数据点之间的相似度。都会造成大吃小的影响,其中一种解决方法是对计数值进行区间量化即数据分桶也叫做数据分箱,然后使用量化后的结果。
  • 分桶的优点:
    • 处理缺失值:当数据源可能存在缺失值,此时可以把null单独作为一个分桶。
    • 处理异常值:当数据中存在离群点时,可以把其通过分桶离散化处理,从而提高变量的鲁棒性(抗干扰能力)。例如,age若出现200这种异常值,可分入“age > 60”这个分桶里,排除影响。
    • 业务解释性:我们习惯于线性判断变量的作用,当x越来越大,y就越来越大。但实际x与y之间经常存在着非线性关系,此时可经过WOE变换。
  • 特别要注意一下分桶的基本原则:
    • 最小分桶占比不低于5%
    • 桶内不能全部是好数据
    • 连续桶单调

固定宽度分桶

  • 当数值横跨多个数量级时,最好按照 10 的幂(或任何常数的幂)来进行分组。固定宽度分桶非常容易计算,但如果计数值中有比较大的缺口,就会产生很多没有任何数据的空桶。
  1. # 通过除法映射到间隔均匀的分桶中,每个分桶的取值范围都是loanAmnt/1000
  2. data['loanAmnt_bin1'] = np.floor_divide(data['loanAmnt'], 1000)
  3. data['loanAmnt_bin1']
  4. # 0 14.0
  5. # 1 20.0
  6. # 2 12.0
  7. # 3 17.0
  8. # 4 35.0
  9. # ...
  10. # 199995 7.0
  11. # 199996 6.0
  12. # 199997 14.0
  13. # 199998 8.0
  14. # 199999 8.0
  15. # Name: loanAmnt_bin1, Length: 200000, dtype: float64
  1. # 通过对数函数映射到指数宽度分桶
  2. data['loanAmnt_bin2'] = np.floor(np.log10(data['loanAmnt']))
  3. data['loanAmnt_bin2']
  4. # 0 4.0
  5. # 1 4.0
  6. # 2 4.0
  7. # 3 4.0
  8. # 4 4.0
  9. # ...
  10. # 199995 3.0
  11. # 199996 3.0
  12. # 199997 4.0
  13. # 199998 3.0
  14. # 199999 3.0
  15. # Name: loanAmnt_bin2, Length: 200000, dtype: float64

分位数分桶

  1. data['loanAmnt_bin3'] = pd.qcut(data['loanAmnt'], 10, labels=False)
  2. data['loanAmnt_bin3']
  3. # 0 5
  4. # 1 7
  5. # 2 4
  6. # 3 6
  7. # 4 9
  8. # ..
  9. # 199995 2
  10. # 199996 1
  11. # 199997 5
  12. # 199998 2
  13. # 199999 2
  14. # Name: loanAmnt_bin3, Length: 200000, dtype: int64

特征交互

  • 交互特征的构造非常简单,使用起来却代价不菲。如果线性模型中包含有交互特征对,那它的训练时间和评分时间就会从 O(n) 增加到 O(n2),其中 n 是单一特征的数量。
  1. for col in ['grade', 'subGrade']:
  2. temp_dict = data_train.groupby([col])['isDefault'].agg(['mean']).reset_index().rename(columns={'mean': col + '_target_mean'})
  3. temp_dict.index = temp_dict[col].values
  4. temp_dict = temp_dict[col + '_target_mean'].to_dict()
  5. data_train[col + '_target_mean'] = data_train[col].map(temp_dict)
  6. data_test_a[col + '_target_mean'] = data_test_a[col].map(temp_dict)
  1. # 其他衍生变量 mean 和 std
  2. for df in [data_train, data_test_a]:
  3. for item in ['n0','n1','n2','n2.1','n4','n5','n6','n7','n8','n9','n10','n11','n12','n13','n14']:
  4. df['grade_to_mean_' + item] = df['grade'] / df.groupby([item])['grade'].transform('mean')
  5. df['grade_to_std_' + item] = df['grade'] / df.groupby([item])['grade'].transform('std')

特征编码

labelEncode 直接放入树模型中

  1. # label-encode: subGrade,postCode,title
  2. # 高维类别特征需要进行转换
  3. for col in tqdm(['employmentTitle', 'postCode', 'title', 'subGrade']):
  4. le = LabelEncoder()
  5. le.fit(list(data_train[col].astype(str).values) + list(data_test_a[col].astype(str).values))
  6. data_train[col] = le.transform(list(data_train[col].astype(str).values))
  7. data_test_a[col] = le.transform(list(data_test_a[col].astype(str).values))
  8. print('Label Encoding 完成')
  9. # 100%|██████████| 4/4 [00:07<00:00, 1.76s/it]
  10. # Label Encoding 完成

逻辑回归等模型要单独增加的特征工程

  • 对特征做归一化,去除相关性高的特征
  • 归一化目的是让训练过程更好更快的收敛,避免特征大吃小的问题
  • 去除相关性是增加模型的可解释性,加快预测过程。
  1. # 举例归一化过程
  2. # 伪代码
  3. for fea in [要归一化的特征列表]:
  4. data[fea] = ((data[fea] - np.min(data[fea])) / (np.max(data[fea]) - np.min(data[fea])))

特征选择

  • 特征选择技术可以精简掉无用的特征,以降低最终模型的复杂性,它的最终目的是得到一个简约模型,在不降低预测准确率或对预测准确率影响不大的情况下提高计算速度。特征选择不是为了减少训练时间(实际上,一些技术会增加总体训练时间),而是为了减少模型评分时间。

Filter

方差选择法

  • 方差选择法中,先要计算各个特征的方差,然后根据设定的阈值,选择方差大于阈值的特征
  1. from sklearn.feature_selection import VarianceThreshold
  2. # 其中参数threshold为方差的阈值
  3. VarianceThreshold(threshold=3).fit_transform(train, target_train)

相关系数法

  • Pearson 相关系数是一种最简单的,可以帮助理解特征和响应变量之间关系的方法,该方法衡量的是变量之间的线性相关性。 结果的取值区间为 [-1,1] , -1 表示完全的负相关, +1表示完全的正相关,0 表示没有线性相关。
  1. from sklearn.feature_selection import SelectKBest
  2. from scipy.stats import pearsonr
  3. # 选择K个最好的特征,返回选择特征后的数据
  4. # 第一个参数为计算评估特征是否好的函数,该函数输入特征矩阵和目标向量,
  5. # 输出二元组(评分,P值)的数组,数组第i项为第i个特征的评分和P值。在此定义为计算相关系数
  6. # 参数k为选择的特征个数
  7. SelectKBest(k=5).fit_transform(train, target_train)

卡方检验

  • 经典的卡方检验是用于检验自变量对因变量的相关性。 假设自变量有N种取值,因变量有M种取值,考虑自变量等于i且因变量等于j的样本频数的观察值与期望的差距。 其统计量如下: χ2=∑(A−T)2T,其中A为实际值,T为理论值
  • 注:卡方只能运用在正定矩阵上,否则会报错Input X must be non-negative
  1. from sklearn.feature_selection import SelectKBest
  2. from sklearn.feature_selection import chi2
  3. # 参数k为选择的特征个数
  4. SelectKBest(chi2, k=5).fit_transform(train, target_train)

互信息法

  • 经典的互信息也是评价自变量对因变量的相关性的。 在feature_selection库的SelectKBest类结合最大信息系数法可以用于选择特征,相关代码如下:
  1. from sklearn.feature_selection import SelectKBest
  2. from minepy import MINE
  3. # 由于MINE的设计不是函数式的,定义mic方法将其作为函数式的,
  4. # 返回一个二元组,二元组的第2项设置成固定的P值0.5
  5. def mic(x, y):
  6. m = MINE()
  7. m.compute_score(x, y)
  8. return (m.mic(), 0.5)
  9. # 参数k为选择的特征个数
  10. SelectKBest(lambda X, Y: array(map(lambda x: mic(x, Y), X.T)).T, k=2).fit_transform(train, target_train)

Wrapper (Recursive feature elimination,RFE)

  • 递归消除特征法使用一个基模型来进行多轮训练,每轮训练后,消除若干权值系数的特征,再基于新的特征集进行下一轮训练。 在feature_selection库的RFE类可以用于选择特征,相关代码如下(以逻辑回归为例):
  1. from sklearn.feature_selection import RFE
  2. from sklearn.linear_model import LogisticRegression
  3. # 递归特征消除法,返回特征选择后的数据
  4. # 参数estimator为基模型
  5. # 参数n_features_to_select为选择的特征个数
  6. RFE(estimator=LogisticRegression(), n_features_to_select=2).fit_transform(train, target_train)

Embedded

  • 基于惩罚项的特征选择法使用带惩罚项的基模型,除了筛选出特征外,同时也进行了降维。 在feature_selection库的SelectFromModel类结合逻辑回归模型可以用于选择特征,相关代码如下:
  1. from sklearn.feature_selection import SelectFromModel
  2. from sklearn.linear_model import LogisticRegression
  3. # 带L1惩罚项的逻辑回归作为基模型的特征选择
  4. SelectFromModel(LogisticRegression(penalty="l1", C=0.1)).fit_transform(train, target_train)
  • 基于树模型的特征选择 树模型中GBDT也可用来作为基模型进行特征选择。 在feature_selection库的SelectFromModel类结合GBDT模型可以用于选择特征,相关代码如下:
  1. from sklearn.feature_selection import SelectFromModel
  2. from sklearn.ensemble import GradientBoostingClassifier
  3. # GBDT作为基模型的特征选择
  4. SelectFromModel(GradientBoostingClassifier()).fit_transform(train, target_train)
  • 本数据集中我们删除非入模特征后,并对缺失值填充,然后用计算协方差的方式看一下特征间相关性,然后进行模型训练
  1. # 删除不需要的数据
  2. for data in [data_train, data_test_a]:
  3. data.drop(['issueDate','id'], axis=1, inplace=True)
  1. # 纵向用缺失值上面的值替换缺失值
  2. data_train = data_train.fillna(axis=0, method='ffill')
  1. x_train = data_train.drop(['isDefault'], axis=1)
  2. # 计算协方差
  3. data_corr = x_train.corrwith(data_train.isDefault) # 计算相关性
  4. result = pd.DataFrame(columns=['features', 'corr'])
  5. result['features'] = data_corr.index
  6. result['corr'] = data_corr.values
  1. # 当然也可以直接看图
  2. data_numeric = data_train[numerical_fea]
  3. correlation = data_numeric.corr()
  4. f , ax = plt.subplots(figsize = (7, 7))
  5. plt.title('Correlation of Numeric Features with Price',y=1,size=16)
  6. sns.heatmap(correlation,square = True, vmax=0.8)

  1. features = [f for f in data_train.columns if f not in ['id','issueDate','isDefault'] and '_outliers' not in f]
  2. x_train = data_train[features]
  3. x_test = data_test_a[features]
  4. y_train = data_train['isDefault']
  1. def cv_model(clf, train_x, train_y, test_x, clf_name):
  2. folds = 5
  3. seed = 2020
  4. kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
  5. train = np.zeros(train_x.shape[0])
  6. test = np.zeros(test_x.shape[0])
  7. cv_scores = []
  8. for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
  9. print('************************************ {} ************************************'.format(str(i+1)))
  10. trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
  11. if clf_name == "lgb":
  12. train_matrix = clf.Dataset(trn_x, label=trn_y)
  13. valid_matrix = clf.Dataset(val_x, label=val_y)
  14. params = {
  15. 'boosting_type': 'gbdt',
  16. 'objective': 'binary',
  17. 'metric': 'auc',
  18. 'min_child_weight': 5,
  19. 'num_leaves': 2 ** 5,
  20. 'lambda_l2': 10,
  21. 'feature_fraction': 0.8,
  22. 'bagging_fraction': 0.8,
  23. 'bagging_freq': 4,
  24. 'learning_rate': 0.1,
  25. 'seed': 2020,
  26. 'nthread': 28,
  27. 'n_jobs':24,
  28. 'silent': True,
  29. 'verbose': -1,
  30. }
  31. model = clf.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], verbose_eval=200,early_stopping_rounds=200)
  32. val_pred = model.predict(val_x, num_iteration=model.best_iteration)
  33. test_pred = model.predict(test_x, num_iteration=model.best_iteration)
  34. # print(list(sorted(zip(features, model.feature_importance("gain")), key=lambda x: x[1], reverse=True))[:20])
  35. if clf_name == "xgb":
  36. train_matrix = clf.DMatrix(trn_x , label=trn_y)
  37. valid_matrix = clf.DMatrix(val_x , label=val_y)
  38. params = {'booster': 'gbtree',
  39. 'objective': 'binary:logistic',
  40. 'eval_metric': 'auc',
  41. 'gamma': 1,
  42. 'min_child_weight': 1.5,
  43. 'max_depth': 5,
  44. 'lambda': 10,
  45. 'subsample': 0.7,
  46. 'colsample_bytree': 0.7,
  47. 'colsample_bylevel': 0.7,
  48. 'eta': 0.04,
  49. 'tree_method': 'exact',
  50. 'seed': 2020,
  51. 'nthread': 36,
  52. "silent": True,
  53. }
  54. watchlist = [(train_matrix, 'train'),(valid_matrix, 'eval')]
  55. model = clf.train(params, train_matrix, num_boost_round=50000, evals=watchlist, verbose_eval=200, early_stopping_rounds=200)
  56. val_pred = model.predict(valid_matrix, ntree_limit=model.best_ntree_limit)
  57. test_pred = model.predict(test_x , ntree_limit=model.best_ntree_limit)
  58. if clf_name == "cat":
  59. params = {'learning_rate': 0.05, 'depth': 5, 'l2_leaf_reg': 10, 'bootstrap_type': 'Bernoulli',
  60. 'od_type': 'Iter', 'od_wait': 50, 'random_seed': 11, 'allow_writing_files': False}
  61. model = clf(iterations=20000, **params)
  62. model.fit(trn_x, trn_y, eval_set=(val_x, val_y),
  63. cat_features=[], use_best_model=True, verbose=500)
  64. val_pred = model.predict(val_x)
  65. test_pred = model.predict(test_x)
  66. train[valid_index] = val_pred
  67. test = test_pred / kf.n_splits
  68. cv_scores.append(roc_auc_score(val_y, val_pred))
  69. print(cv_scores)
  70. print("%s_scotrainre_list:" % clf_name, cv_scores)
  71. print("%s_score_mean:" % clf_name, np.mean(cv_scores))
  72. print("%s_score_std:" % clf_name, np.std(cv_scores))
  73. return train, test
  1. def lgb_model(x_train, y_train, x_test):
  2. lgb_train, lgb_test = cv_model(lgb, x_train, y_train, x_test, "lgb")
  3. return lgb_train, lgb_test
  4. def xgb_model(x_train, y_train, x_test):
  5. xgb_train, xgb_test = cv_model(xgb, x_train, y_train, x_test, "xgb")
  6. return xgb_train, xgb_test
  7. def cat_model(x_train, y_train, x_test):
  8. cat_train, cat_test = cv_model(CatBoostRegressor, x_train, y_train, x_test, "cat")
  9. return cat_train, cat_test
  1. lgb_train, lgb_test = lgb_model(x_train, y_train, x_test)
  2. # ************************************ 1 ************************************
  3. # [LightGBM] [Warning] num_threads is set with n_jobs=24, nthread=28 will be ignored. Current value: num_threads=24
  4. # [LightGBM] [Warning] Unknown parameter: silent
  5. # Training until validation scores don't improve for 200 rounds
  6. # [200] training's auc: 0.749114 valid_1's auc: 0.729275
  7. # [400] training's auc: 0.764716 valid_1's auc: 0.730125
  8. # [600] training's auc: 0.778489 valid_1's auc: 0.729928
  9. # Early stopping, best iteration is:
  10. # [446] training's auc: 0.768137 valid_1's auc: 0.730186
  11. # [0.7301862239949224]
  12. # ************************************ 2 ************************************
  13. # [LightGBM] [Warning] num_threads is set with n_jobs=24, nthread=28 will be ignored. Current value: num_threads=24
  14. # [LightGBM] [Warning] Unknown parameter: silent
  15. # Training until validation scores don't improve for 200 rounds
  16. # [200] training's auc: 0.748999 valid_1's auc: 0.731035
  17. # [400] training's auc: 0.764879 valid_1's auc: 0.731436
  18. # [600] training's auc: 0.778506 valid_1's auc: 0.730823
  19. # Early stopping, best iteration is:
  20. # [414] training's auc: 0.765823 valid_1's auc: 0.731478
  21. # [0.7301862239949224, 0.7314779648434573]
  22. # ************************************ 3 ************************************
  23. # [LightGBM] [Warning] num_threads is set with n_jobs=24, nthread=28 will be ignored. Current value: num_threads=24
  24. # [LightGBM] [Warning] Unknown parameter: silent
  25. # Training until validation scores don't improve for 200 rounds
  26. # [200] training's auc: 0.748145 valid_1's auc: 0.73253
  27. # [400] training's auc: 0.763814 valid_1's auc: 0.733272
  28. # [600] training's auc: 0.777895 valid_1's auc: 0.733354
  29. # Early stopping, best iteration is:
  30. # [475] training's auc: 0.769215 valid_1's auc: 0.73355
  31. # [0.7301862239949224, 0.7314779648434573, 0.7335502065719879]
  32. # ************************************ 4 ************************************
  33. # [LightGBM] [Warning] num_threads is set with n_jobs=24, nthread=28 will be ignored. Current value: num_threads=24
  34. # [LightGBM] [Warning] Unknown parameter: silent
  35. # Training until validation scores don't improve for 200 rounds
  36. # [200] training's auc: 0.749417 valid_1's auc: 0.727507
  37. # [400] training's auc: 0.765066 valid_1's auc: 0.728261
  38. # Early stopping, best iteration is:
  39. # [353] training's auc: 0.761647 valid_1's auc: 0.728349
  40. # [0.7301862239949224, 0.7314779648434573, 0.7335502065719879, 0.7283491938614568]
  41. # ************************************ 5 ************************************
  42. # [LightGBM] [Warning] num_threads is set with n_jobs=24, nthread=28 will be ignored. Current value: num_threads=24
  43. # [LightGBM] [Warning] Unknown parameter: silent
  44. # Training until validation scores don't improve for 200 rounds
  45. # [200] training's auc: 0.748562 valid_1's auc: 0.73262
  46. # [400] training's auc: 0.764493 valid_1's auc: 0.733365
  47. # Early stopping, best iteration is:
  48. # [394] training's auc: 0.764109 valid_1's auc: 0.733381
  49. # [0.7301862239949224, 0.7314779648434573, 0.7335502065719879, 0.7283491938614568, 0.7333810157041901]
  50. # lgb_scotrainre_list: [0.7301862239949224, 0.7314779648434573, 0.7335502065719879, 0.7283491938614568, 0.7333810157041901]
  51. # lgb_score_mean: 0.7313889209952029
  52. # lgb_score_std: 0.001966415347937543
  1. lgb_train, lgb_test = xgb_model(x_train, y_train, x_test)
  2. # ************************************ 1 ************************************
  3. # [15:02:32] WARNING: ../src/learner.cc:516:
  4. # Parameters: { silent } might not be used.
  5. # This may not be accurate due to some parameters are only used in language bindings but
  6. # passed down to XGBoost core. Or some parameters are not used but slip through this
  7. # verification. Please open an issue if you find above cases.
  8. # [0] train-auc:0.69713 eval-auc:0.69580
  9. # Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.
  10. # Will train until eval-auc hasn't improved in 200 rounds.
  11. # [200] train-auc:0.73103 eval-auc:0.72371
  12. # [400] train-auc:0.74040 eval-auc:0.72807
  13. # [600] train-auc:0.74624 eval-auc:0.72966
  14. # [800] train-auc:0.75132 eval-auc:0.73055
  15. # [1000] train-auc:0.75580 eval-auc:0.73101
  16. # [1200] train-auc:0.76004 eval-auc:0.73127
  17. # [1400] train-auc:0.76409 eval-auc:0.73156
  18. # [1600] train-auc:0.76791 eval-auc:0.73169
  19. # [1800] train-auc:0.77156 eval-auc:0.73173
  20. # [2000] train-auc:0.77506 eval-auc:0.73167
  21. # Stopping. Best iteration:
  22. # [1852] train-auc:0.77251 eval-auc:0.73177
  23. # [0.731769339538683]
  24. # ************************************ 2 ************************************
  25. # [15:07:16] WARNING: ../src/learner.cc:516:
  26. # Parameters: { silent } might not be used.
  27. # This may not be accurate due to some parameters are only used in language bindings but
  28. # passed down to XGBoost core. Or some parameters are not used but slip through this
  29. # verification. Please open an issue if you find above cases.
  30. # [0] train-auc:0.69687 eval-auc:0.69574
  31. # Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.
  32. # Will train until eval-auc hasn't improved in 200 rounds.
  33. # [200] train-auc:0.73078 eval-auc:0.72640
  34. # [400] train-auc:0.74020 eval-auc:0.73023
  35. # [600] train-auc:0.74605 eval-auc:0.73156
  36. # [800] train-auc:0.75114 eval-auc:0.73231
  37. # [1000] train-auc:0.75562 eval-auc:0.73275
  38. # [1200] train-auc:0.75987 eval-auc:0.73310
  39. # [1400] train-auc:0.76372 eval-auc:0.73317
  40. # [1600] train-auc:0.76757 eval-auc:0.73330
  41. # [1800] train-auc:0.77123 eval-auc:0.73335
  42. # [2000] train-auc:0.77484 eval-auc:0.73339
  43. # Stopping. Best iteration:
  44. # [1829] train-auc:0.77173 eval-auc:0.73340
  45. # [0.731769339538683, 0.733395913606802]
  46. # ************************************ 3 ************************************
  47. # [15:11:52] WARNING: ../src/learner.cc:516:
  48. # Parameters: { silent } might not be used.
  49. # This may not be accurate due to some parameters are only used in language bindings but
  50. # passed down to XGBoost core. Or some parameters are not used but slip through this
  51. # verification. Please open an issue if you find above cases.
  52. # [0] train-auc:0.69730 eval-auc:0.69647
  53. # Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.
  54. # Will train until eval-auc hasn't improved in 200 rounds.
  55. # [200] train-auc:0.73072 eval-auc:0.72604
  56. # [400] train-auc:0.73965 eval-auc:0.73076
  57. # [600] train-auc:0.74548 eval-auc:0.73241
  58. # [800] train-auc:0.75050 eval-auc:0.73356
  59. # [1000] train-auc:0.75501 eval-auc:0.73416
  60. # [1200] train-auc:0.75898 eval-auc:0.73460
  61. # [1400] train-auc:0.76303 eval-auc:0.73487
  62. # [1600] train-auc:0.76689 eval-auc:0.73507
  63. # [1800] train-auc:0.77059 eval-auc:0.73507
  64. # Stopping. Best iteration:
  65. # [1703] train-auc:0.76871 eval-auc:0.73515
  66. # [0.731769339538683, 0.733395913606802, 0.7351456720593506]
  67. # ************************************ 4 ************************************
  68. # [15:16:15] WARNING: ../src/learner.cc:516:
  69. # Parameters: { silent } might not be used.
  70. # This may not be accurate due to some parameters are only used in language bindings but
  71. # passed down to XGBoost core. Or some parameters are not used but slip through this
  72. # verification. Please open an issue if you find above cases.
  73. # [0] train-auc:0.69737 eval-auc:0.69375
  74. # Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.
  75. # Will train until eval-auc hasn't improved in 200 rounds.
  76. # [200] train-auc:0.73148 eval-auc:0.72250
  77. # [400] train-auc:0.74044 eval-auc:0.72639
  78. # [600] train-auc:0.74649 eval-auc:0.72804
  79. # [800] train-auc:0.75154 eval-auc:0.72887
  80. # [1000] train-auc:0.75598 eval-auc:0.72934
  81. # [1200] train-auc:0.75997 eval-auc:0.72954
  82. # [1400] train-auc:0.76401 eval-auc:0.72977
  83. # [1600] train-auc:0.76793 eval-auc:0.72989
  84. # [1800] train-auc:0.77159 eval-auc:0.72993
  85. # [2000] train-auc:0.77511 eval-auc:0.73002
  86. # [2200] train-auc:0.77850 eval-auc:0.72996
  87. # Stopping. Best iteration:
  88. # [2011] train-auc:0.77531 eval-auc:0.73004
  89. # [0.731769339538683, 0.733395913606802, 0.7351456720593506, 0.7300361842852358]
  90. # ************************************ 5 ************************************
  91. # [15:21:18] WARNING: ../src/learner.cc:516:
  92. # Parameters: { silent } might not be used.
  93. # This may not be accurate due to some parameters are only used in language bindings but
  94. # passed down to XGBoost core. Or some parameters are not used but slip through this
  95. # verification. Please open an issue if you find above cases.
  96. # [0] train-auc:0.69647 eval-auc:0.69701
  97. # Multiple eval metrics have been passed: 'eval-auc' will be used for early stopping.
  98. # Will train until eval-auc hasn't improved in 200 rounds.
  99. # [200] train-auc:0.73059 eval-auc:0.72675
  100. # [400] train-auc:0.73972 eval-auc:0.73089
  101. # [600] train-auc:0.74589 eval-auc:0.73256
  102. # [800] train-auc:0.75073 eval-auc:0.73347
  103. # [1000] train-auc:0.75523 eval-auc:0.73401
  104. # [1200] train-auc:0.75941 eval-auc:0.73419
  105. # [1400] train-auc:0.76342 eval-auc:0.73438
  106. # [1600] train-auc:0.76730 eval-auc:0.73458
  107. # [1800] train-auc:0.77105 eval-auc:0.73454
  108. # Stopping. Best iteration:
  109. # [1694] train-auc:0.76910 eval-auc:0.73464
  110. # [0.731769339538683, 0.733395913606802, 0.7351456720593506, 0.7300361842852358, 0.734639280693211]
  111. # xgb_scotrainre_list: [0.731769339538683, 0.733395913606802, 0.7351456720593506, 0.7300361842852358, 0.734639280693211]
  112. # xgb_score_mean: 0.7329972780366564
  113. # xgb_score_std: 0.0018839633265100187
  1. lgb_train, lgb_test = cat_model(x_train, y_train, x_test)
  2. # ************************************ 1 ************************************
  3. # 0: learn: 0.3944330 test: 0.3964727 best: 0.3964727 (0) total: 138ms remaining: 45m 59s
  4. # 500: learn: 0.3728126 test: 0.3756408 best: 0.3756408 (500) total: 28.1s remaining: 18m 13s
  5. # 1000: learn: 0.3711980 test: 0.3750523 best: 0.3750523 (1000) total: 56.2s remaining: 17m 47s
  6. # 1500: learn: 0.3699538 test: 0.3748118 best: 0.3748107 (1476) total: 1m 23s remaining: 17m 11s
  7. # 2000: learn: 0.3688546 test: 0.3746815 best: 0.3746815 (2000) total: 1m 51s remaining: 16m 44s
  8. # Stopped by overfitting detector (50 iterations wait)
  9. # bestTest = 0.3746253358
  10. # bestIteration = 2266
  11. # Shrink model to first 2267 iterations.
  12. # [0.7306375926022922]
  13. # ************************************ 2 ************************************
  14. # 0: learn: 0.3947513 test: 0.3951211 best: 0.3951211 (0) total: 71.1ms remaining: 23m 41s
  15. # 500: learn: 0.3731076 test: 0.3743412 best: 0.3743412 (500) total: 28.6s remaining: 18m 32s
  16. # 1000: learn: 0.3714544 test: 0.3737577 best: 0.3737570 (999) total: 56.7s remaining: 17m 56s
  17. # 1500: learn: 0.3702186 test: 0.3735397 best: 0.3735396 (1498) total: 1m 24s remaining: 17m 23s
  18. # 2000: learn: 0.3691118 test: 0.3734092 best: 0.3734074 (1977) total: 1m 52s remaining: 16m 54s
  19. # 2500: learn: 0.3680796 test: 0.3733234 best: 0.3733218 (2484) total: 2m 21s remaining: 16m 28s
  20. # Stopped by overfitting detector (50 iterations wait)
  21. # bestTest = 0.373251629
  22. # bestIteration = 2919
  23. # Shrink model to first 2920 iterations.
  24. # [0.7306375926022922, 0.7325015175914498]
  25. # ************************************ 3 ************************************
  26. # 0: learn: 0.3951060 test: 0.3937487 best: 0.3937487 (0) total: 70.2ms remaining: 23m 24s
  27. # 500: learn: 0.3734715 test: 0.3730983 best: 0.3730983 (500) total: 28.4s remaining: 18m 26s
  28. # 1000: learn: 0.3718399 test: 0.3724184 best: 0.3724184 (1000) total: 56.5s remaining: 17m 53s
  29. # 1500: learn: 0.3706048 test: 0.3721639 best: 0.3721639 (1500) total: 1m 24s remaining: 17m 24s
  30. # 2000: learn: 0.3695127 test: 0.3720199 best: 0.3720199 (2000) total: 1m 52s remaining: 16m 52s
  31. # 2500: learn: 0.3685041 test: 0.3719052 best: 0.3719025 (2479) total: 2m 20s remaining: 16m 20s
  32. # Stopped by overfitting detector (50 iterations wait)
  33. # bestTest = 0.3719024831
  34. # bestIteration = 2479
  35. # Shrink model to first 2480 iterations.
  36. # [0.7306375926022922, 0.7325015175914498, 0.7340103693991001]
  37. # ************************************ 4 ************************************
  38. # 0: learn: 0.3949491 test: 0.3943298 best: 0.3943298 (0) total: 66.8ms remaining: 22m 16s
  39. # 500: learn: 0.3732214 test: 0.3741316 best: 0.3741316 (500) total: 28.2s remaining: 18m 18s
  40. # 1000: learn: 0.3715666 test: 0.3735451 best: 0.3735414 (995) total: 56s remaining: 17m 42s
  41. # 1500: learn: 0.3703238 test: 0.3733058 best: 0.3733045 (1498) total: 1m 23s remaining: 17m 9s
  42. # 2000: learn: 0.3692105 test: 0.3731636 best: 0.3731634 (1999) total: 1m 51s remaining: 16m 41s
  43. # 2500: learn: 0.3681907 test: 0.3730490 best: 0.3730490 (2500) total: 2m 19s remaining: 16m 13s
  44. # Stopped by overfitting detector (50 iterations wait)
  45. # bestTest = 0.3730185197
  46. # bestIteration = 2723
  47. # Shrink model to first 2724 iterations.
  48. # [0.7306375926022922, 0.7325015175914498, 0.7340103693991001, 0.7291287412227256]
  49. # ************************************ 5 ************************************
  50. # 0: learn: 0.3948860 test: 0.3944692 best: 0.3944692 (0) total: 68.4ms remaining: 22m 47s
  51. # 500: learn: 0.3733508 test: 0.3734623 best: 0.3734623 (500) total: 28.7s remaining: 18m 37s
  52. # 1000: learn: 0.3717222 test: 0.3729094 best: 0.3729094 (1000) total: 57s remaining: 18m 2s
  53. # 1500: learn: 0.3704933 test: 0.3726407 best: 0.3726407 (1500) total: 1m 25s remaining: 17m 32s
  54. # 2000: learn: 0.3693930 test: 0.3725202 best: 0.3725200 (1998) total: 1m 53s remaining: 17m 4s
  55. # 2500: learn: 0.3683883 test: 0.3724494 best: 0.3724494 (2500) total: 2m 22s remaining: 16m 36s
  56. # Stopped by overfitting detector (50 iterations wait)
  57. # bestTest = 0.3724045318
  58. # bestIteration = 2904
  59. # Shrink model to first 2905 iterations.
  60. # [0.7306375926022922, 0.7325015175914498, 0.7340103693991001, 0.7291287412227256, 0.7342835786894728]
  61. # cat_scotrainre_list: [0.7306375926022922, 0.7325015175914498, 0.7340103693991001, 0.7291287412227256, 0.7342835786894728]
  62. # cat_score_mean: 0.7321123599010082
  63. # cat_score_std: 0.0019771188023493848

总结

特征工程是机器学习,甚至是深度学习中最为重要的一部分,在实际应用中往往也是所花费时间最多的一步。各种算法书中对特征工程部分的讲解往往少得可怜,因为特征工程和具体的数据结合的太紧密,很难系统地覆盖所有场景。本章主要是通过一些常用的方法来做介绍,例如缺失值异常值的处理方法详细对任何数据集来说都是适用的。但对于分箱等操作本章给出了具体的几种思路,需要读者自己探索。在特征工程中比赛和具体的应用还是有所不同的,在实际的金融风控评分卡制作过程中,由于强调特征的可解释性,特征分箱尤其重要。

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小丑西瓜9/article/detail/142863
推荐阅读
相关标签
  

闽ICP备14008679号