赞
踩
保险是重要的金融体系,对社会发展,民生保障起到重要作用。保险欺诈近些年层出不穷,在某些险种上保险欺诈的金额已经占到了理赔金额的20%甚至更多。对保险欺诈的识别成为保险行业中的关键应用场景。
阿里云天池里面可以找到数据集
- import pandas as pd
-
- # 数据加载
- train = pd.read_csv('train.csv')
- train
policy_id | age | customer_months | policy_bind_date | policy_state | policy_csl | policy_deductable | policy_annual_premium | umbrella_limit | insured_zip | ... | witnesses | police_report_available | total_claim_amount | injury_claim | property_claim | vehicle_claim | auto_make | auto_model | auto_year | fraud | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 122576 | 37 | 189 | 2013-08-21 | C | 500/1000 | 1000 | 1465.71 | 5000000 | 455456 | ... | 3 | ? | 54930 | 6029 | 5752 | 44452 | Nissan | Maxima | 2000 | 0 |
1 | 937713 | 44 | 234 | 1998-01-04 | B | 250/500 | 500 | 821.24 | 0 | 591805 | ... | 1 | YES | 50680 | 5376 | 10156 | 37347 | Honda | Civic | 1996 | 0 |
2 | 680237 | 33 | 23 | 1996-02-06 | B | 500/1000 | 1000 | 1844.00 | 0 | 442490 | ... | 1 | NO | 47829 | 4460 | 9247 | 33644 | Jeep | Wrangler | 2002 | 0 |
3 | 513080 | 42 | 210 | 2008-11-14 | A | 500/1000 | 500 | 1867.29 | 0 | 439408 | ... | 2 | YES | 68862 | 11043 | 5955 | 53548 | Suburu | Legacy | 2003 | 1 |
4 | 192875 | 29 | 81 | 2002-01-08 | A | 100/300 | 1000 | 816.25 | 0 | 640575 | ... | 1 | YES | 59726 | 5617 | 10301 | 41550 | Ford | F150 | 2004 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
695 | 1008425 | 37 | 196 | 1997-06-29 | C | 250/500 | 500 | 1301.20 | 0 | 474615 | ... | 3 | NO | 61433 | 10436 | 11432 | 39745 | Nissan | Pathfinder | 2011 | 1 |
696 | 770702 | 43 | 229 | 2001-05-29 | A | 250/500 | 500 | 1434.94 | 8000000 | 444476 | ... | 1 | ? | 68623 | 6798 | 14557 | 50606 | Volkswagen | Passat | 2013 | 1 |
697 | 755099 | 35 | 209 | 2003-01-11 | C | 100/300 | 500 | 1639.46 | 0 | 639608 | ... | 0 | YES | 58033 | 9129 | 4598 | 40740 | Mercedes | C300 | 2002 | 0 |
698 | 693804 | 44 | 275 | 2003-07-22 | B | 500/1000 | 2000 | 1042.29 | 0 | 432061 | ... | 0 | NO | 35253 | 7359 | 3464 | 24677 | Audi | A3 | 2007 | 1 |
699 | 598086 | 47 | 263 | 1996-08-15 | C | 500/1000 | 500 |
- test = pd.read_csv('./test.csv')
- test
policy_id | age | customer_months | policy_bind_date | policy_state | policy_csl | policy_deductable | policy_annual_premium | umbrella_limit | insured_zip | ... | bodily_injuries | witnesses | police_report_available | total_claim_amount | injury_claim | property_claim | vehicle_claim | auto_make | auto_model | auto_year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 681822 | 60 | 473 | 2002-12-17 | B | 500/1000 | 1000 | 1134.96 | 0 | 445975 | ... | 0 | 3 | ? | 53253 | 5212 | 10251 | 39503 | Saab | 95 | 2006 |
1 | 301288 | 36 | 173 | 1994-01-15 | B | 100/300 | 1000 | 916.20 | 0 | 469238 | ... | 0 | 0 | NO | 69401 | 8309 | 8439 | 50012 | Mercedes | ML350 | 2008 |
2 | 212001 | 36 | 147 | 1995-12-19 | B | 500/1000 | 1000 | 1175.74 | 5000000 | 595953 | ... | 2 | 0 | NO | 63919 | 5572 | 11477 | 42801 | Dodge | Neon | 2009 |
3 | 797680 | 24 | 71 | 1992-06-20 | C | 500/1000 | 500 | 1472.40 | 0 | 613103 | ... | 0 | 0 | NO | 63173 | 12027 | 6500 | 43423 | Dodge | RAM | 2012 |
4 | 789334 | 39 | 230 | 1996-11-28 | C | 250/500 | 1000 | 1159.44 | 4000000 | 581581 | ... | 0 | 0 | ? | 8847 | 904 | 1786 | 6138 | Accura | RSX | 2003 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
295 | 663065 | 36 | 30 | 1999-08-18 | B | 500/1000 | 2000 | 1384.15 | 9000000 | 593323 | ... | 0 | 1 | YES | 4507 | 970 | 477 | 3339 | Dodge | Neon | 2002 |
296 | 283767 | 47 | 285 | 2009-12-23 | C | 250/500 | 500 | 1590.78 | 7000000 | 447235 | ... | 0 | 3 | YES | 45909 | 5599 | 5627 | 34598 | Jeep | Grand Cherokee | 1999 |
297 | 325099 | 39 | 256 | 1999-04-08 | C | 500/1000 | 2000 | 1265.24 | 0 | 592069 | ... | 0 | 0 | ? | 42293 | 5773 | 5491 | 34805 | Dodge | RAM | 1997 |
298 | 465673 | 35 | 54 | 2010-09-08 | C | 100/300 | 500 | 1229.74 | 0 | 451451 | ... | 2 | 0 | ? | 76875 | 14955 | 7312 | 59418 | Nissan | Maxima | 2012 |
299 | 913900 | 34 | 154 | 1990-09-27 | C | 100/300 | 500 | 1744.33 | 0 | 462941 | ... | 1 | 1 | YES | 76269 | 8260 | 8354 | 59141 | Honda | Civic | 1998 |
- data = pd.concat([train, test], axis=0)
- data
policy_id | age | customer_months | policy_bind_date | policy_state | policy_csl | policy_deductable | policy_annual_premium | umbrella_limit | insured_zip | ... | witnesses | police_report_available | total_claim_amount | injury_claim | property_claim | vehicle_claim | auto_make | auto_model | auto_year | fraud | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 122576 | 37 | 189 | 2013-08-21 | C | 500/1000 | 1000 | 1465.71 | 5000000 | 455456 | ... | 3 | ? | 54930 | 6029 | 5752 | 44452 | Nissan | Maxima | 2000 | 0.0 |
1 | 937713 | 44 | 234 | 1998-01-04 | B | 250/500 | 500 | 821.24 | 0 | 591805 | ... | 1 | YES | 50680 | 5376 | 10156 | 37347 | Honda | Civic | 1996 | 0.0 |
2 | 680237 | 33 | 23 | 1996-02-06 | B | 500/1000 | 1000 | 1844.00 | 0 | 442490 | ... | 1 | NO | 47829 | 4460 | 9247 | 33644 | Jeep | Wrangler | 2002 | 0.0 |
3 | 513080 | 42 | 210 | 2008-11-14 | A | 500/1000 | 500 | 1867.29 | 0 | 439408 | ... | 2 | YES | 68862 | 11043 | 5955 | 53548 | Suburu | Legacy | 2003 | 1.0 |
4 | 192875 | 29 | 81 | 2002-01-08 | A | 100/300 | 1000 | 816.25 | 0 | 640575 | ... | 1 | YES | 59726 | 5617 | 10301 | 41550 | Ford | F150 | 2004 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
295 | 663065 | 36 | 30 | 1999-08-18 | B | 500/1000 | 2000 | 1384.15 | 9000000 | 593323 | ... | 1 | YES | 4507 | 970 | 477 | 3339 | Dodge | Neon | 2002 | NaN |
296 | 283767 | 47 | 285 | 2009-12-23 | C | 250/500 | 500 | 1590.78 | 7000000 | 447235 | ... | 3 | YES | 45909 | 5599 | 5627 | 34598 | Jeep | Grand Cherokee | 1999 | NaN |
297 | 325099 | 39 | 256 | 1999-04-08 | C | 500/1000 | 2000 | 1265.24 | 0 | 592069 | ... | 0 | ? | 42293 | 5773 | 5491 | 34805 | Dodge | RAM | 1997 | NaN |
298 | 465673 | 35 | 54 | 2010-09-08 | C | 100/300 | 500 | 1229.74 | 0 | 451451 | ... | 0 | ? | 76875 | 14955 | 7312 | 59418 | Nissan | Maxima | 2012 | NaN |
299 | 913900 | 34 | 154 | 1990-09-27 | C | 100/300 | 500 | 1744.33 | 0 | 462941 | ... | 1 | YES | 76269 | 8260 | 8354 | 59141 | Honda | Civic | 1998 | NaN |
- data.index = range(len(data))
- data
policy_id | age | customer_months | policy_bind_date | policy_state | policy_csl | policy_deductable | policy_annual_premium | umbrella_limit | insured_zip | ... | witnesses | police_report_available | total_claim_amount | injury_claim | property_claim | vehicle_claim | auto_make | auto_model | auto_year | fraud | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 122576 | 37 | 189 | 2013-08-21 | C | 500/1000 | 1000 | 1465.71 | 5000000 | 455456 | ... | 3 | ? | 54930 | 6029 | 5752 | 44452 | Nissan | Maxima | 2000 | 0.0 |
1 | 937713 | 44 | 234 | 1998-01-04 | B | 250/500 | 500 | 821.24 | 0 | 591805 | ... | 1 | YES | 50680 | 5376 | 10156 | 37347 | Honda | Civic | 1996 | 0.0 |
2 | 680237 | 33 | 23 | 1996-02-06 | B | 500/1000 | 1000 | 1844.00 | 0 | 442490 | ... | 1 | NO | 47829 | 4460 | 9247 | 33644 | Jeep | Wrangler | 2002 | 0.0 |
3 | 513080 | 42 | 210 | 2008-11-14 | A | 500/1000 | 500 | 1867.29 | 0 | 439408 | ... | 2 | YES | 68862 | 11043 | 5955 | 53548 | Suburu | Legacy | 2003 | 1.0 |
4 | 192875 | 29 | 81 | 2002-01-08 | A | 100/300 | 1000 | 816.25 | 0 | 640575 | ... | 1 | YES | 59726 | 5617 | 10301 | 41550 | Ford | F150 | 2004 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 663065 | 36 | 30 | 1999-08-18 | B | 500/1000 | 2000 | 1384.15 | 9000000 | 593323 | ... | 1 | YES | 4507 | 970 | 477 | 3339 | Dodge | Neon | 2002 | NaN |
996 | 283767 | 47 | 285 | 2009-12-23 | C | 250/500 | 500 | 1590.78 | 7000000 | 447235 | ... | 3 | YES | 45909 | 5599 | 5627 | 34598 | Jeep | Grand Cherokee | 1999 | NaN |
997 | 325099 | 39 | 256 | 1999-04-08 | C | 500/1000 | 2000 | 1265.24 | 0 | 592069 | ... | 0 | ? | 42293 | 5773 | 5491 | 34805 | Dodge | RAM | 1997 | NaN |
998 | 465673 | 35 | 54 | 2010-09-08 | C | 100/300 | 500 | 1229.74 | 0 | 451451 | ... | 0 | ? | 76875 | 14955 | 7312 | 59418 | Nissan | Maxima | 2012 | NaN |
999 | 913900 | 34 | 154 | 1990-09-27 | C | 100/300 | 500 | 1744.33 | 0 | 462941 | ... | 1 | YES | 76269 | 8260 | 8354 | 59141 | Honda | Civic | 1998 | NaN |
1000 rows × 38 columns
data.isnull().sum()
policy_id 0 age 0 customer_months 0 policy_bind_date 0 policy_state 0 policy_csl 0 policy_deductable 0 policy_annual_premium 0 umbrella_limit 0 insured_zip 0 insured_sex 0 insured_education_level 0 insured_occupation 0 insured_hobbies 0 insured_relationship 0 capital-gains 0 capital-loss 0 incident_date 0 incident_type 0 collision_type 0 incident_severity 0 authorities_contacted 0 incident_state 0 incident_city 0 incident_hour_of_the_day 0...
auto_make 0 auto_model 0 auto_year 0 fraud 300 dtype: int64
- # 唯一值个数
- for col in data.columns:
- print(col, data[col].nunique())
policy_id 1000 age 45 customer_months 385 policy_bind_date 955 policy_state 3 policy_csl 3 policy_deductable 3 policy_annual_premium 996 umbrella_limit 11 insured_zip 999 insured_sex 2 insured_education_level 7 insured_occupation 14 insured_hobbies 20 insured_relationship 6 capital-gains 490 capital-loss 525 incident_date 113 incident_type 4 collision_type 4 incident_severity 4 authorities_contacted 5 incident_state 7 incident_city 7 incident_hour_of_the_day 24
auto_make 14 auto_model 39 auto_year 21 fraud 2
- cat_columns = data.select_dtypes(include='O').columns
- cat_columns
Index(['policy_bind_date', 'policy_state', 'policy_csl', 'insured_sex', 'insured_education_level', 'insured_occupation', 'insured_hobbies', 'insured_relationship', 'incident_date', 'incident_type', 'collision_type', 'incident_severity', 'authorities_contacted', 'incident_state', 'incident_city', 'property_damage', 'police_report_available', 'auto_make', 'auto_model'], dtype='object')
- column_name = []
-
- unique_value = []
-
- for col in cat_columns:
-
- column_name.append(col)
-
- unique_value.append(data[col].nunique())
-
-
-
- df = pd.DataFrame()
-
- df['col_name'] = column_name
-
- df['value'] = unique_value
-
- df = df.sort_values('value', ascending=False)
-
- df
col_name | value | |
---|---|---|
0 | policy_bind_date | 955 |
8 | incident_date | 113 |
18 | auto_model | 39 |
6 | insured_hobbies | 20 |
5 | insured_occupation | 14 |
17 | auto_make | 14 |
4 | insured_education_level | 7 |
13 | incident_state | 7 |
14 | incident_city | 7 |
7 | insured_relationship | 6 |
12 | authorities_contacted | 5 |
10 | collision_type | 4 |
11 | incident_severity | 4 |
9 | incident_type | 4 |
2 | policy_csl | 3 |
1 | policy_state | 3 |
15 | property_damage | 3 |
16 | police_report_available | 3 |
3 | insured_sex | 2 |
data[cat_columns]
policy_bind_date | policy_state | policy_csl | insured_sex | insured_education_level | insured_occupation | insured_hobbies | insured_relationship | incident_date | incident_type | collision_type | incident_severity | authorities_contacted | incident_state | incident_city | property_damage | police_report_available | auto_make | auto_model | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013-08-21 | C | 500/1000 | FEMALE | Masters | protective-serv | reading | not-in-family | 2014-12-22 | Single Vehicle Collision | Side Collision | Total Loss | Ambulance | S5 | Riverwood | ? | ? | Nissan | Maxima |
1 | 1998-01-04 | B | 250/500 | MALE | JD | craft-repair | polo | other-relative | 2015-02-18 | Multi-vehicle Collision | Side Collision | Minor Damage | Other | S5 | Springfield | ? | YES | Honda | Civic |
2 | 1996-02-06 | B | 500/1000 | FEMALE | High School | machine-op-inspct | skydiving | wife | 2015-01-18 | Single Vehicle Collision | Side Collision | Total Loss | Police | S3 | Northbend | ? | NO | Jeep | Wrangler |
3 | 2008-11-14 | A | 500/1000 | MALE | JD | transport-moving | video-games | own-child | 2015-02-02 | Multi-vehicle Collision | Front Collision | Major Damage | Fire | S3 | Northbend | YES | YES | Suburu | Legacy |
4 | 2002-01-08 | A | 100/300 | FEMALE | MD | craft-repair | video-games | own-child | 2015-02-09 | Multi-vehicle Collision | Rear Collision | Total Loss | Fire | S2 | Northbend | YES | YES | Ford | F150 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 1999-08-18 | B | 500/1000 | FEMALE | College | prof-specialty | kayaking | not-in-family | 2015-01-14 | Parked Car | ? | Trivial Damage | None | S3 | Arlington | NO | YES | Dodge | Neon |
996 | 2009-12-23 | C | 250/500 | MALE | MD | adm-clerical | movies | unmarried | 2015-02-09 | Multi-vehicle Collision | Side Collision | Total Loss | Fire | S3 | Columbus | ? | YES | Jeep | Grand Cherokee |
997 | 1999-04-08 | C | 500/1000 | FEMALE | Associate | other-service | hiking | not-in-family | 2014-12-21 | Single Vehicle Collision | Rear Collision | Minor Damage | Police | S1 | Hillsdale | YES | ? | Dodge | RAM |
998 | 2010-09-08 | C | 100/300 | FEMALE | MD | protective-serv | hiking | unmarried | 2015-01-27 | Multi-vehicle Collision | Side Collision | Minor Damage | Fire | S4 | Springfield | YES | ? | Nissan | Maxima |
999 | 1990-09-27 | C | 100/300 | MALE | Masters | protective-serv | dancing | other-relative | 2015-01-29 | Single Vehicle Collision | Front Collision | Minor Damage | Ambulance | S7 | Hillsdale | NO | YES | Honda | Civic |
1000 rows × 19 columns
- # 单独看某个字段
- data['property_damage'].value_counts()
- data['property_damage'] = data['property_damage'].map({'NO': 0, 'YES': 1, '?': 2})
- data['property_damage'].value_counts()
2 360
0 338
1 302
Name: property_damage, dtype: int64
- data['police_report_available'].value_counts()
- data['police_report_available'] = data['police_report_available'].map({'NO': 0, 'YES': 1, '?': 2})
- data['police_report_available'].value_counts()
2 343
0 343
1 314
Name: police_report_available, dtype: int64
- # policy_bind_date, incident_date
- data['policy_bind_date'] = pd.to_datetime(data['policy_bind_date'])
- data['incident_date'] = pd.to_datetime(data['incident_date'])
-
- # 查看最大日期,最小日期
- data['policy_bind_date'].min() # 1990-01-08
- data['policy_bind_date'].max() # 2015-02-22
-
- data['incident_date'].min() # 2015-01-01
- data['incident_date'].max() # 2015-03-01
Timestamp('2015-03-29 00:00:00')
- base_date = data['policy_bind_date'].min()
- # 转换为date_diff
- data['policy_bind_date_diff'] = (data['policy_bind_date'] - base_date).dt.days
- data['incident_date_diff'] = (data['incident_date'] - base_date).dt.days
- data['incident_date_policy_bind_date_diff'] = data['incident_date_diff'] - data['policy_bind_date_diff']
- data[['policy_bind_date', 'incident_date', 'policy_bind_date_diff', 'incident_date_diff', 'incident_date_policy_bind_date_diff']]
policy_bind_date | incident_date | policy_bind_date_diff | incident_date_diff | incident_date_policy_bind_date_diff | |
---|---|---|---|---|---|
0 | 2013-08-21 | 2014-12-22 | 8640 | 9128 | 488 |
1 | 1998-01-04 | 2015-02-18 | 2932 | 9186 | 6254 |
2 | 1996-02-06 | 2015-01-18 | 2234 | 9155 | 6921 |
3 | 2008-11-14 | 2015-02-02 | 6899 | 9170 | 2271 |
4 | 2002-01-08 | 2015-02-09 | 4397 | 9177 | 4780 |
... | ... | ... | ... | ... | ... |
995 | 1999-08-18 | 2015-01-14 | 3523 | 9151 | 5628 |
996 | 2009-12-23 | 2015-02-09 | 7303 | 9177 | 1874 |
997 | 1999-04-08 | 2014-12-21 | 3391 | 9127 | 5736 |
998 | 2010-09-08 | 2015-01-27 | 7562 | 9164 | 1602 |
999 | 1990-09-27 | 2015-01-29 | 276 | 9166 | 8890 |
1000 rows × 5 columns
- data.drop(['policy_bind_date', 'incident_date'], axis=1, inplace=True)
- data
policy_id | age | customer_months | policy_state | policy_csl | policy_deductable | policy_annual_premium | umbrella_limit | insured_zip | insured_sex | ... | injury_claim | property_claim | vehicle_claim | auto_make | auto_model | auto_year | fraud | policy_bind_date_diff | incident_date_diff | incident_date_policy_bind_date_diff | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 122576 | 37 | 189 | C | 500/1000 | 1000 | 1465.71 | 5000000 | 455456 | FEMALE | ... | 6029 | 5752 | 44452 | Nissan | Maxima | 2000 | 0.0 | 8640 | 9128 | 488 |
1 | 937713 | 44 | 234 | B | 250/500 | 500 | 821.24 | 0 | 591805 | MALE | ... | 5376 | 10156 | 37347 | Honda | Civic | 1996 | 0.0 | 2932 | 9186 | 6254 |
2 | 680237 | 33 | 23 | B | 500/1000 | 1000 | 1844.00 | 0 | 442490 | FEMALE | ... | 4460 | 9247 | 33644 | Jeep | Wrangler | 2002 | 0.0 | 2234 | 9155 | 6921 |
3 | 513080 | 42 | 210 | A | 500/1000 | 500 | 1867.29 | 0 | 439408 | MALE | ... | 11043 | 5955 | 53548 | Suburu | Legacy | 2003 | 1.0 | 6899 | 9170 | 2271 |
4 | 192875 | 29 | 81 | A | 100/300 | 1000 | 816.25 | 0 | 640575 | FEMALE | ... | 5617 | 10301 | 41550 | Ford | F150 | 2004 | 0.0 | 4397 | 9177 | 4780 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 663065 | 36 | 30 | B | 500/1000 | 2000 | 1384.15 | 9000000 | 593323 | FEMALE | ... | 970 | 477 | 3339 | Dodge | Neon | 2002 | NaN | 3523 | 9151 | 5628 |
996 | 283767 | 47 | 285 | C | 250/500 | 500 | 1590.78 | 7000000 | 447235 | MALE | ... | 5599 | 5627 | 34598 | Jeep | Grand Cherokee | 1999 | NaN | 7303 | 9177 | 1874 |
997 | 325099 | 39 | 256 | C | 500/1000 | 2000 | 1265.24 | 0 | 592069 | FEMALE | ... | 5773 | 5491 | 34805 | Dodge | RAM | 1997 | NaN | 3391 | 9127 | 5736 |
998 | 465673 | 35 | 54 | C | 100/300 | 500 | 1229.74 | 0 | 451451 | FEMALE | ... | 14955 | 7312 | 59418 | Nissan | Maxima | 2012 | NaN | 7562 | 9164 | 1602 |
999 | 913900 | 34 | 154 | C | 100/300 | 500 | 1744.33 | 0 | 462941 | MALE | ... | 8260 | 8354 | 59141 | Honda | Civic | 1998 | NaN | 276 | 9166 | 8890 |
1000 rows × 39 columns
- data.drop(['policy_id'], axis=1, inplace=True)
- data.columns
Index(['age', 'customer_months', 'policy_state', 'policy_csl', 'policy_deductable', 'policy_annual_premium', 'umbrella_limit', 'insured_zip', 'insured_sex', 'insured_education_level', 'insured_occupation', 'insured_hobbies', 'insured_relationship', 'capital-gains', 'capital-loss', 'incident_type', 'collision_type', 'incident_severity', 'authorities_contacted', 'incident_state', 'incident_city', 'incident_hour_of_the_day', 'number_of_vehicles_involved', 'property_damage', 'bodily_injuries', 'witnesses', 'police_report_available', 'total_claim_amount', 'injury_claim', 'property_claim', 'vehicle_claim', 'auto_make', 'auto_model', 'auto_year', 'fraud', 'policy_bind_date_diff', 'incident_date_diff', 'incident_date_policy_bind_date_diff'], dtype='object')
- ## 标签编码
- from sklearn.preprocessing import LabelEncoder
- cat_columns = data.select_dtypes(include='O').columns
- for col in cat_columns:
- le = LabelEncoder()
- data[col] = le.fit_transform(data[col])
- data[cat_columns]
policy_state | policy_csl | insured_sex | insured_education_level | insured_occupation | insured_hobbies | insured_relationship | incident_type | collision_type | incident_severity | authorities_contacted | incident_state | incident_city | auto_make | auto_model | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 2 | 0 | 5 | 10 | 15 | 1 | 2 | 3 | 2 | 0 | 4 | 5 | 9 | 26 |
1 | 1 | 1 | 1 | 3 | 2 | 14 | 2 | 0 | 3 | 1 | 3 | 4 | 6 | 6 | 10 |
2 | 1 | 2 | 0 | 2 | 6 | 16 | 5 | 2 | 3 | 2 | 4 | 2 | 3 | 7 | 36 |
3 | 0 | 2 | 1 | 3 | 13 | 18 | 3 | 0 | 1 | 0 | 1 | 2 | 3 | 11 | 21 |
4 | 0 | 0 | 0 | 4 | 2 | 18 | 3 | 0 | 2 | 2 | 1 | 1 | 3 | 5 | 14 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | 1 | 2 | 0 | 1 | 9 | 11 | 1 | 1 | 0 | 3 | 2 | 2 | 0 | 4 | 27 |
996 | 2 | 1 | 1 | 4 | 0 | 12 | 4 | 0 | 3 | 2 | 1 | 2 | 1 | 7 | 17 |
997 | 2 | 2 | 0 | 0 | 7 | 10 | 1 | 2 | 2 | 1 | 4 | 0 | 2 | 4 | 30 |
998 | 2 | 0 | 0 | 4 | 10 | 10 | 4 | 0 | 3 | 1 | 1 | 3 | 6 | 9 | 26 |
999 | 2 | 0 | 1 | 5 | 10 | 7 | 2 | 2 | 1 | 1 | 0 | 6 | 2 | 6 | 10 |
1000 rows × 15 columns
- train = data[data['fraud'].notnull()]
- test = data[data['fraud'].isnull()]
- import lightgbm as lgb
- model_lgb = lgb.LGBMClassifier(
- num_leaves=2**5-1, reg_alpha=0.25, reg_lambda=0.25, objective='binary',
- max_depth=-1, learning_rate=0.005, min_child_samples=3, random_state=2022,
- n_estimators=2000, subsample=1, colsample_bytree=1,
- )
- # 模型训练
- model_lgb.fit(train.drop(['fraud'], axis=1), train['fraud'])
- # AUC评测: 以proba进行提交,结果会更好
- y_pred = model_lgb.predict_proba(test.drop(['fraud'], axis=1))
- y_pred
array([[9.45512572e-01, 5.44874283e-02], [2.82473773e-01, 7.17526227e-01], [9.93965667e-01, 6.03433310e-03], [9.76151564e-01, 2.38484363e-02], [9.80256195e-01, 1.97438052e-02], [9.14403023e-01, 8.55969772e-02], [9.92182848e-01, 7.81715179e-03], [9.95752551e-01, 4.24744905e-03], [9.97857734e-01, 2.14226648e-03], [9.87760328e-01, 1.22396721e-02], [9.82583675e-01, 1.74163246e-02], [9.94944970e-01, 5.05503008e-03], [2.40580578e-01, 7.59419422e-01], [9.94618068e-01, 5.38193197e-03], [9.93408223e-01, 6.59177664e-03], [6.73835883e-01, 3.26164117e-01], [9.77552932e-01, 2.24470682e-02], [9.90247497e-01, 9.75250283e-03], [4.04842173e-02, 9.59515783e-01], [9.90730834e-01, 9.26916620e-03], [9.88710210e-01, 1.12897903e-02], [9.40698501e-01, 5.93014990e-02], [1.80777338e-01, 8.19222662e-01], [9.69550179e-01, 3.04498213e-02], [9.88685801e-01, 1.13141994e-02],
[9.94965319e-01, 5.03468063e-03], [9.94962679e-01, 5.03732092e-03], [9.96830618e-01, 3.16938159e-03], [9.90193503e-01, 9.80649715e-03], [9.91544053e-01, 8.45594716e-03]])
train['fraud'].mean()
0.25857142857142856
y_pred[:, 1]
array([5.44874283e-02, 7.17526227e-01, 6.03433310e-03, 2.38484363e-02, 1.97438052e-02, 8.55969772e-02, 7.81715179e-03, 4.24744905e-03, 2.14226648e-03, 1.22396721e-02, 1.74163246e-02, 5.05503008e-03, 7.59419422e-01, 5.38193197e-03, 6.59177664e-03, 3.26164117e-01, 2.24470682e-02, 9.75250283e-03, 9.59515783e-01, 9.26916620e-03, 1.12897903e-02, 5.93014990e-02, 8.19222662e-01, 3.04498213e-02, 1.13141994e-02, 7.87110802e-01, 3.51231514e-03, 3.11553853e-03, 5.00394301e-01, 3.87089331e-03, 5.79284039e-01, 1.03762342e-02, 9.74583399e-01, 1.50181609e-02, 1.26499281e-02, 1.66637544e-02, 8.67847939e-01, 3.41162415e-03, 8.04672111e-03, 2.10217393e-03, 9.10874722e-01, 5.99570004e-03, 2.42292836e-03, 3.68146008e-03, 9.37376096e-02, 6.16000904e-03, 1.61912989e-02, 7.34289768e-03, 9.09509256e-01, 5.45276866e-03, 6.32876633e-03, 5.47020132e-03, 4.88930674e-03, 5.57701967e-01, 3.34273498e-01, 5.34055926e-03, 1.25786465e-02, 8.42825670e-01, 2.30038169e-02, 5.59174107e-01, 6.38115178e-03, 1.22456277e-02, 2.25640586e-02, 8.64430205e-01, 6.02454975e-03, 1.80564536e-02, 8.72487291e-01, 9.59318359e-01, 8.09359944e-03, 9.75164716e-03, 7.31212517e-03, 2.44207611e-02, 6.63915914e-01, 7.02252967e-01, 1.08551058e-03, 2.71298989e-03, 1.26591748e-02, 7.70278791e-04, 8.35599276e-01, 8.27508089e-03, 1.27339206e-02, 2.43397773e-02, 2.56317959e-01, 4.77122460e-02, 7.70369021e-03, 2.48156544e-03, 1.51776195e-02, 3.29396922e-03, 1.24885745e-02, 7.51917783e-01, 4.29885247e-03, 8.24166104e-01, 9.22399316e-01, 7.32952970e-03, 2.87178506e-03, 2.00915417e-02, 1.29787033e-02, 1.06394187e-02, 2.23851715e-01, 9.45179658e-03,
1.64958505e-02, 6.79121674e-02, 1.30053897e-02, 7.09693541e-01, 5.58447882e-01, 1.05058832e-03, 7.19235810e-01, 4.89357407e-02, 1.95654359e-01, 9.78073382e-01, 4.08296287e-02, 9.41766778e-01, 4.98860263e-01, 4.33861786e-03, 3.91737138e-01, 5.03468063e-03, 5.03732092e-03, 3.16938159e-03, 9.80649715e-03, 8.45594716e-03])
- result = pd.read_csv('submission.csv')
- result['fraud'] = y_pred[:, 1]
- result.to_csv('baseline.csv', index=False)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。