当前位置:   article > 正文

保险反欺诈预测学习_umbrella_limit

umbrella_limit

前言

保险是重要的金融体系,对社会发展,民生保障起到重要作用。保险欺诈近些年层出不穷,在某些险种上保险欺诈的金额已经占到了理赔金额的20%甚至更多。对保险欺诈的识别成为保险行业中的关键应用场景。

数据集

阿里云天池里面可以找到数据集

一,数据加载

  1. import pandas as pd
  2. # 数据加载
  3. train = pd.read_csv('train.csv')
  4. train
policy_idagecustomer_monthspolicy_bind_datepolicy_statepolicy_cslpolicy_deductablepolicy_annual_premiumumbrella_limitinsured_zip...witnessespolice_report_availabletotal_claim_amountinjury_claimproperty_claimvehicle_claimauto_makeauto_modelauto_yearfraud
0122576371892013-08-21C500/100010001465.715000000455456...3?549306029575244452NissanMaxima20000
1937713442341998-01-04B250/500500821.240591805...1YES5068053761015637347HondaCivic19960
268023733231996-02-06B500/100010001844.000442490...1NO478294460924733644JeepWrangler20020
3513080422102008-11-14A500/10005001867.290439408...2YES6886211043595553548SuburuLegacy20031
419287529812002-01-08A100/3001000816.250640575...1YES5972656171030141550FordF15020040
..................................................................
6951008425371961997-06-29C250/5005001301.200474615...3NO61433104361143239745NissanPathfinder20111
696770702432292001-05-29A250/5005001434.948000000444476...1?6862367981455750606VolkswagenPassat20131
697755099352092003-01-11C100/3005001639.460639608...0YES580339129459840740MercedesC30020020
698693804442752003-07-22B500/100020001042.290432061...0NO352537359346424677AudiA320071
699598086472631996-08-15C500/1000500
  1. test = pd.read_csv('./test.csv')
  2. test
policy_idagecustomer_monthspolicy_bind_datepolicy_statepolicy_cslpolicy_deductablepolicy_annual_premiumumbrella_limitinsured_zip...bodily_injurieswitnessespolice_report_availabletotal_claim_amountinjury_claimproperty_claimvehicle_claimauto_makeauto_modelauto_year
0681822604732002-12-17B500/100010001134.960445975...03?5325352121025139503Saab952006
1301288361731994-01-15B100/3001000916.200469238...00NO694018309843950012MercedesML3502008
2212001361471995-12-19B500/100010001175.745000000595953...20NO6391955721147742801DodgeNeon2009
379768024711992-06-20C500/10005001472.400613103...00NO6317312027650043423DodgeRAM2012
4789334392301996-11-28C250/50010001159.444000000581581...00?884790417866138AccuraRSX2003
..................................................................
29566306536301999-08-18B500/100020001384.159000000593323...01YES45079704773339DodgeNeon2002
296283767472852009-12-23C250/5005001590.787000000447235...03YES459095599562734598JeepGrand Cherokee1999
297325099392561999-04-08C500/100020001265.240592069...00?422935773549134805DodgeRAM1997
29846567335542010-09-08C100/3005001229.740451451...20?7687514955731259418NissanMaxima2012
299913900341541990-09-27C100/3005001744.330462941...11YES762698260835459141HondaCivic1998

 合并train, test

  1. data = pd.concat([train, test], axis=0)
  2. data
policy_idagecustomer_monthspolicy_bind_datepolicy_statepolicy_cslpolicy_deductablepolicy_annual_premiumumbrella_limitinsured_zip...witnessespolice_report_availabletotal_claim_amountinjury_claimproperty_claimvehicle_claimauto_makeauto_modelauto_yearfraud
0122576371892013-08-21C500/100010001465.715000000455456...3?549306029575244452NissanMaxima20000.0
1937713442341998-01-04B250/500500821.240591805...1YES5068053761015637347HondaCivic19960.0
268023733231996-02-06B500/100010001844.000442490...1NO478294460924733644JeepWrangler20020.0
3513080422102008-11-14A500/10005001867.290439408...2YES6886211043595553548SuburuLegacy20031.0
419287529812002-01-08A100/3001000816.250640575...1YES5972656171030141550FordF15020040.0
..................................................................
29566306536301999-08-18B500/100020001384.159000000593323...1YES45079704773339DodgeNeon2002NaN
296283767472852009-12-23C250/5005001590.787000000447235...3YES459095599562734598JeepGrand Cherokee1999NaN
297325099392561999-04-08C500/100020001265.240592069...0?422935773549134805DodgeRAM1997NaN
29846567335542010-09-08C100/3005001229.740451451...0?7687514955731259418NissanMaxima2012NaN
299913900341541990-09-27C100/3005001744.330462941...1YES762698260835459141HondaCivic1998NaN
  1. data.index = range(len(data))
  2. data
policy_idagecustomer_monthspolicy_bind_datepolicy_statepolicy_cslpolicy_deductablepolicy_annual_premiumumbrella_limitinsured_zip...witnessespolice_report_availabletotal_claim_amountinjury_claimproperty_claimvehicle_claimauto_makeauto_modelauto_yearfraud
0122576371892013-08-21C500/100010001465.715000000455456...3?549306029575244452NissanMaxima20000.0
1937713442341998-01-04B250/500500821.240591805...1YES5068053761015637347HondaCivic19960.0
268023733231996-02-06B500/100010001844.000442490...1NO478294460924733644JeepWrangler20020.0
3513080422102008-11-14A500/10005001867.290439408...2YES6886211043595553548SuburuLegacy20031.0
419287529812002-01-08A100/3001000816.250640575...1YES5972656171030141550FordF15020040.0
..................................................................
99566306536301999-08-18B500/100020001384.159000000593323...1YES45079704773339DodgeNeon2002NaN
996283767472852009-12-23C250/5005001590.787000000447235...3YES459095599562734598JeepGrand Cherokee1999NaN
997325099392561999-04-08C500/100020001265.240592069...0?422935773549134805DodgeRAM1997NaN
99846567335542010-09-08C100/3005001229.740451451...0?7687514955731259418NissanMaxima2012NaN
999913900341541990-09-27C100/3005001744.330462941...1YES762698260835459141HondaCivic1998NaN

1000 rows × 38 columns

二,数据探索

data.isnull().sum()

policy_id 0 age 0 customer_months 0 policy_bind_date 0 policy_state 0 policy_csl 0 policy_deductable 0 policy_annual_premium 0 umbrella_limit 0 insured_zip 0 insured_sex 0 insured_education_level 0 insured_occupation 0 insured_hobbies 0 insured_relationship 0 capital-gains 0 capital-loss 0 incident_date 0 incident_type 0 collision_type 0 incident_severity 0 authorities_contacted 0 incident_state 0 incident_city 0 incident_hour_of_the_day 0...

auto_make 0 auto_model 0 auto_year 0 fraud 300 dtype: int64

唯一值个数

  1. # 唯一值个数
  2. for col in data.columns:
  3. print(col, data[col].nunique())

policy_id 1000 age 45 customer_months 385 policy_bind_date 955 policy_state 3 policy_csl 3 policy_deductable 3 policy_annual_premium 996 umbrella_limit 11 insured_zip 999 insured_sex 2 insured_education_level 7 insured_occupation 14 insured_hobbies 20 insured_relationship 6 capital-gains 490 capital-loss 525 incident_date 113 incident_type 4 collision_type 4 incident_severity 4 authorities_contacted 5 incident_state 7 incident_city 7 incident_hour_of_the_day 24

auto_make 14 auto_model 39 auto_year 21 fraud 2

  1. cat_columns = data.select_dtypes(include='O').columns
  2. cat_columns

Index(['policy_bind_date', 'policy_state', 'policy_csl', 'insured_sex', 'insured_education_level', 'insured_occupation', 'insured_hobbies', 'insured_relationship', 'incident_date', 'incident_type', 'collision_type', 'incident_severity', 'authorities_contacted', 'incident_state', 'incident_city', 'property_damage', 'police_report_available', 'auto_make', 'auto_model'], dtype='object')

  1. column_name = []
  2. unique_value = []
  3. for col in cat_columns:
  4.     column_name.append(col)
  5.     unique_value.append(data[col].nunique())
  6. df = pd.DataFrame()
  7. df['col_name'] = column_name
  8. df['value'] = unique_value
  9. df = df.sort_values('value', ascending=False)
  10. df
col_namevalue
0policy_bind_date955
8incident_date113
18auto_model39
6insured_hobbies20
5insured_occupation14
17auto_make14
4insured_education_level7
13incident_state7
14incident_city7
7insured_relationship6
12authorities_contacted5
10collision_type4
11incident_severity4
9incident_type4
2policy_csl3
1policy_state3
15property_damage3
16police_report_available3
3insured_sex2
data[cat_columns]
policy_bind_datepolicy_statepolicy_cslinsured_sexinsured_education_levelinsured_occupationinsured_hobbiesinsured_relationshipincident_dateincident_typecollision_typeincident_severityauthorities_contactedincident_stateincident_cityproperty_damagepolice_report_availableauto_makeauto_model
02013-08-21C500/1000FEMALEMastersprotective-servreadingnot-in-family2014-12-22Single Vehicle CollisionSide CollisionTotal LossAmbulanceS5Riverwood??NissanMaxima
11998-01-04B250/500MALEJDcraft-repairpoloother-relative2015-02-18Multi-vehicle CollisionSide CollisionMinor DamageOtherS5Springfield?YESHondaCivic
21996-02-06B500/1000FEMALEHigh Schoolmachine-op-inspctskydivingwife2015-01-18Single Vehicle CollisionSide CollisionTotal LossPoliceS3Northbend?NOJeepWrangler
32008-11-14A500/1000MALEJDtransport-movingvideo-gamesown-child2015-02-02Multi-vehicle CollisionFront CollisionMajor DamageFireS3NorthbendYESYESSuburuLegacy
42002-01-08A100/300FEMALEMDcraft-repairvideo-gamesown-child2015-02-09Multi-vehicle CollisionRear CollisionTotal LossFireS2NorthbendYESYESFordF150
............................................................
9951999-08-18B500/1000FEMALECollegeprof-specialtykayakingnot-in-family2015-01-14Parked Car?Trivial DamageNoneS3ArlingtonNOYESDodgeNeon
9962009-12-23C250/500MALEMDadm-clericalmoviesunmarried2015-02-09Multi-vehicle CollisionSide CollisionTotal LossFireS3Columbus?YESJeepGrand Cherokee
9971999-04-08C500/1000FEMALEAssociateother-servicehikingnot-in-family2014-12-21Single Vehicle CollisionRear CollisionMinor DamagePoliceS1HillsdaleYES?DodgeRAM
9982010-09-08C100/300FEMALEMDprotective-servhikingunmarried2015-01-27Multi-vehicle CollisionSide CollisionMinor DamageFireS4SpringfieldYES?NissanMaxima
9991990-09-27C100/300MALEMastersprotective-servdancingother-relative2015-01-29Single Vehicle CollisionFront CollisionMinor DamageAmbulanceS7HillsdaleNOYESHondaCivic

1000 rows × 19 columns

单独看某个字段

  1. # 单独看某个字段
  2. data['property_damage'].value_counts()
  3. data['property_damage'] = data['property_damage'].map({'NO': 0, 'YES': 1, '?': 2})
  4. data['property_damage'].value_counts()

2 360

0 338

1 302

Name: property_damage, dtype: int64

  1. data['police_report_available'].value_counts()
  2. data['police_report_available'] = data['police_report_available'].map({'NO': 0, 'YES': 1, '?': 2})
  3. data['police_report_available'].value_counts()

2 343

0 343

1 314

Name: police_report_available, dtype: int64

查看最大日期,最小日期

  1. # policy_bind_date, incident_date
  2. data['policy_bind_date'] = pd.to_datetime(data['policy_bind_date'])
  3. data['incident_date'] = pd.to_datetime(data['incident_date'])
  4. # 查看最大日期,最小日期
  5. data['policy_bind_date'].min() # 1990-01-08
  6. data['policy_bind_date'].max() # 2015-02-22
  7. data['incident_date'].min() # 2015-01-01
  8. data['incident_date'].max() # 2015-03-01

Timestamp('2015-03-29 00:00:00')

  1. base_date = data['policy_bind_date'].min()
  2. # 转换为date_diff
  3. data['policy_bind_date_diff'] = (data['policy_bind_date'] - base_date).dt.days
  1. data['incident_date_diff'] = (data['incident_date'] - base_date).dt.days
  2. data['incident_date_policy_bind_date_diff'] = data['incident_date_diff'] - data['policy_bind_date_diff']
  3. data[['policy_bind_date', 'incident_date', 'policy_bind_date_diff', 'incident_date_diff', 'incident_date_policy_bind_date_diff']]
policy_bind_dateincident_datepolicy_bind_date_diffincident_date_diffincident_date_policy_bind_date_diff
02013-08-212014-12-2286409128488
11998-01-042015-02-18293291866254
21996-02-062015-01-18223491556921
32008-11-142015-02-02689991702271
42002-01-082015-02-09439791774780
..................
9951999-08-182015-01-14352391515628
9962009-12-232015-02-09730391771874
9971999-04-082014-12-21339191275736
9982010-09-082015-01-27756291641602
9991990-09-272015-01-2927691668890

1000 rows × 5 columns

去掉原始日期字段 policy_bind_date    incident_date

  1. data.drop(['policy_bind_date', 'incident_date'], axis=1, inplace=True)
  2. data
policy_idagecustomer_monthspolicy_statepolicy_cslpolicy_deductablepolicy_annual_premiumumbrella_limitinsured_zipinsured_sex...injury_claimproperty_claimvehicle_claimauto_makeauto_modelauto_yearfraudpolicy_bind_date_diffincident_date_diffincident_date_policy_bind_date_diff
012257637189C500/100010001465.715000000455456FEMALE...6029575244452NissanMaxima20000.086409128488
193771344234B250/500500821.240591805MALE...53761015637347HondaCivic19960.0293291866254
26802373323B500/100010001844.000442490FEMALE...4460924733644JeepWrangler20020.0223491556921
351308042210A500/10005001867.290439408MALE...11043595553548SuburuLegacy20031.0689991702271
41928752981A100/3001000816.250640575FEMALE...56171030141550FordF15020040.0439791774780
..................................................................
9956630653630B500/100020001384.159000000593323FEMALE...9704773339DodgeNeon2002NaN352391515628
99628376747285C250/5005001590.787000000447235MALE...5599562734598JeepGrand Cherokee1999NaN730391771874
99732509939256C500/100020001265.240592069FEMALE...5773549134805DodgeRAM1997NaN339191275736
9984656733554C100/3005001229.740451451FEMALE...14955731259418NissanMaxima2012NaN756291641602
99991390034154C100/3005001744.330462941MALE...8260835459141HondaCivic1998NaN27691668890

1000 rows × 39 columns

  1. data.drop(['policy_id'], axis=1, inplace=True)
  2. data.columns

Index(['age', 'customer_months', 'policy_state', 'policy_csl', 'policy_deductable', 'policy_annual_premium', 'umbrella_limit', 'insured_zip', 'insured_sex', 'insured_education_level', 'insured_occupation', 'insured_hobbies', 'insured_relationship', 'capital-gains', 'capital-loss', 'incident_type', 'collision_type', 'incident_severity', 'authorities_contacted', 'incident_state', 'incident_city', 'incident_hour_of_the_day', 'number_of_vehicles_involved', 'property_damage', 'bodily_injuries', 'witnesses', 'police_report_available', 'total_claim_amount', 'injury_claim', 'property_claim', 'vehicle_claim', 'auto_make', 'auto_model', 'auto_year', 'fraud', 'policy_bind_date_diff', 'incident_date_diff', 'incident_date_policy_bind_date_diff'], dtype='object')

  1. ## 标签编码
  2. from sklearn.preprocessing import LabelEncoder
  3. cat_columns = data.select_dtypes(include='O').columns
  4. for col in cat_columns:
  5. le = LabelEncoder()
  6. data[col] = le.fit_transform(data[col])
  7. data[cat_columns]
policy_statepolicy_cslinsured_sexinsured_education_levelinsured_occupationinsured_hobbiesinsured_relationshipincident_typecollision_typeincident_severityauthorities_contactedincident_stateincident_cityauto_makeauto_model
0220510151232045926
111132142031346610
212026165232423736
30213131830101231121
400042183022113514
................................................
99512019111103220427
99621140124032121717
99722007101221402430
998200410104031136926
99920151072211062610

1000 rows × 15 columns

三, 数据集切分

  1. train = data[data['fraud'].notnull()]
  2. test = data[data['fraud'].isnull()]

四,模型训练 

  1. import lightgbm as lgb
  2. model_lgb = lgb.LGBMClassifier(
  3. num_leaves=2**5-1, reg_alpha=0.25, reg_lambda=0.25, objective='binary',
  4. max_depth=-1, learning_rate=0.005, min_child_samples=3, random_state=2022,
  5. n_estimators=2000, subsample=1, colsample_bytree=1,
  6. )
  7. # 模型训练
  8. model_lgb.fit(train.drop(['fraud'], axis=1), train['fraud'])
  9. # AUC评测: 以proba进行提交,结果会更好
  10. y_pred = model_lgb.predict_proba(test.drop(['fraud'], axis=1))
  11. y_pred

array([[9.45512572e-01, 5.44874283e-02], [2.82473773e-01, 7.17526227e-01], [9.93965667e-01, 6.03433310e-03], [9.76151564e-01, 2.38484363e-02], [9.80256195e-01, 1.97438052e-02], [9.14403023e-01, 8.55969772e-02], [9.92182848e-01, 7.81715179e-03], [9.95752551e-01, 4.24744905e-03], [9.97857734e-01, 2.14226648e-03], [9.87760328e-01, 1.22396721e-02], [9.82583675e-01, 1.74163246e-02], [9.94944970e-01, 5.05503008e-03], [2.40580578e-01, 7.59419422e-01], [9.94618068e-01, 5.38193197e-03], [9.93408223e-01, 6.59177664e-03], [6.73835883e-01, 3.26164117e-01], [9.77552932e-01, 2.24470682e-02], [9.90247497e-01, 9.75250283e-03], [4.04842173e-02, 9.59515783e-01], [9.90730834e-01, 9.26916620e-03], [9.88710210e-01, 1.12897903e-02], [9.40698501e-01, 5.93014990e-02], [1.80777338e-01, 8.19222662e-01], [9.69550179e-01, 3.04498213e-02], [9.88685801e-01, 1.13141994e-02],

[9.94965319e-01, 5.03468063e-03], [9.94962679e-01, 5.03732092e-03], [9.96830618e-01, 3.16938159e-03], [9.90193503e-01, 9.80649715e-03], [9.91544053e-01, 8.45594716e-03]])

测评分数 

train['fraud'].mean()

0.25857142857142856

y_pred[:, 1]

array([5.44874283e-02, 7.17526227e-01, 6.03433310e-03, 2.38484363e-02, 1.97438052e-02, 8.55969772e-02, 7.81715179e-03, 4.24744905e-03, 2.14226648e-03, 1.22396721e-02, 1.74163246e-02, 5.05503008e-03, 7.59419422e-01, 5.38193197e-03, 6.59177664e-03, 3.26164117e-01, 2.24470682e-02, 9.75250283e-03, 9.59515783e-01, 9.26916620e-03, 1.12897903e-02, 5.93014990e-02, 8.19222662e-01, 3.04498213e-02, 1.13141994e-02, 7.87110802e-01, 3.51231514e-03, 3.11553853e-03, 5.00394301e-01, 3.87089331e-03, 5.79284039e-01, 1.03762342e-02, 9.74583399e-01, 1.50181609e-02, 1.26499281e-02, 1.66637544e-02, 8.67847939e-01, 3.41162415e-03, 8.04672111e-03, 2.10217393e-03, 9.10874722e-01, 5.99570004e-03, 2.42292836e-03, 3.68146008e-03, 9.37376096e-02, 6.16000904e-03, 1.61912989e-02, 7.34289768e-03, 9.09509256e-01, 5.45276866e-03, 6.32876633e-03, 5.47020132e-03, 4.88930674e-03, 5.57701967e-01, 3.34273498e-01, 5.34055926e-03, 1.25786465e-02, 8.42825670e-01, 2.30038169e-02, 5.59174107e-01, 6.38115178e-03, 1.22456277e-02, 2.25640586e-02, 8.64430205e-01, 6.02454975e-03, 1.80564536e-02, 8.72487291e-01, 9.59318359e-01, 8.09359944e-03, 9.75164716e-03, 7.31212517e-03, 2.44207611e-02, 6.63915914e-01, 7.02252967e-01, 1.08551058e-03, 2.71298989e-03, 1.26591748e-02, 7.70278791e-04, 8.35599276e-01, 8.27508089e-03, 1.27339206e-02, 2.43397773e-02, 2.56317959e-01, 4.77122460e-02, 7.70369021e-03, 2.48156544e-03, 1.51776195e-02, 3.29396922e-03, 1.24885745e-02, 7.51917783e-01, 4.29885247e-03, 8.24166104e-01, 9.22399316e-01, 7.32952970e-03, 2.87178506e-03, 2.00915417e-02, 1.29787033e-02, 1.06394187e-02, 2.23851715e-01, 9.45179658e-03,

1.64958505e-02, 6.79121674e-02, 1.30053897e-02, 7.09693541e-01, 5.58447882e-01, 1.05058832e-03, 7.19235810e-01, 4.89357407e-02, 1.95654359e-01, 9.78073382e-01, 4.08296287e-02, 9.41766778e-01, 4.98860263e-01, 4.33861786e-03, 3.91737138e-01, 5.03468063e-03, 5.03732092e-03, 3.16938159e-03, 9.80649715e-03, 8.45594716e-03])

  1. result = pd.read_csv('submission.csv')
  2. result['fraud'] = y_pred[:, 1]
  3. result.to_csv('baseline.csv', index=False)

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/我家自动化/article/detail/355773
推荐阅读
  

闽ICP备14008679号