赞
踩
贷款申请人向某(P2P)平台申请贷款时,平台会通过线上或者线下让客户填写借贷申请表,收集客户的基本信息,同时会借助第三方如征信机构的信息,通过这些信息属性来做成逻辑回归预测模型,平台可以通过预测判断贷款申请是否会违约,从而决定是否向申请人发送贷款。
算法根据历史数据需要建立一个模型来
数据集是lending club平台发生的借贷 的业务数据,共有52个变量,39522条记录。
(1)数据预处理
1、查看数据的总体情况
import warnings
warnings.filterwarnings("ignore")
#去掉一些没用的特征,如desc,url等,并将剩下的特征保留在一个新的csv文件中:
import pandas as pd
loans_2020=pd.read_csv("./LoanStats3a.csv",skiprows=1)#第一列是字符串需要跳过
half_count=len(loans_2020)/2 #4万行除以2=19767.5行
loans_2020=loans_2020.dropna(thresh=half_count,axis=1)#2万行中删除空白值超过一半的列,thresh:删除
loans_2020=loans_2020.drop(["desc","url"],axis=1)#按照列中,删除描述和url链接;
loans_2020.to_csv("loans_2020.csv",index=False)#追加到loans_2020.csv中,index表示不加索引。
#输出数据标签,初步判断无用特征:
import pandas as pd
loans_2020=pd.read_csv("loans_2020.csv")
print("第一行的数据展示\n",loans_2020.iloc[0])#第一行的数据
print("原始数据=",loans_2020.shape[1])#shape[1]代表有多少列,shape[0]代表有多少行;
输出:
第一行的数据展示 id 1077501 member_id 1.2966e+06 loan_amnt 5000 funded_amnt 5000 funded_amnt_inv 4975 term 36 months int_rate 10.65% installment 162.87 grade B sub_grade B2 emp_title NaN emp_length 10+ years home_ownership RENT annual_inc 24000 verification_status Verified issue_d Dec-11 loan_status Fully Paid pymnt_plan n purpose credit_card title Computer zip_code 860xx addr_state AZ dti 27.65 delinq_2yrs 0 earliest_cr_line Jan-85 inq_last_6mths 1 open_acc 3 pub_rec 0 revol_bal 13648 revol_util 83.70% total_acc 9 initial_list_status f out_prncp 0 out_prncp_inv 0 total_pymnt 5863.16 total_pymnt_inv 5833.84 total_rec_prncp 5000 total_rec_int 863.16 total_rec_late_fee 0 recoveries 0 collection_recovery_fee 0 last_pymnt_d Jan-15 last_pymnt_amnt 171.62 last_credit_pull_d Nov-16 collections_12_mths_ex_med 0 policy_code 1 application_type INDIVIDUAL acc_now_delinq 0 chargeoff_within_12_mths 0 delinq_amnt 0 pub_rec_bankruptcies 0 tax_liens 0 Name: 0, dtype: object 原始数据= 52
可以很明显地从常识来判断“ID”与“member id ”与银行是否进行放贷没有关系,funded_amount和funded_amunt_inv为预测之后银行对该借贷人的放款,也没有关系。因此按照产品经理以及大家共同商议来进行特征选择,择去掉的特征代码。
2、删除无用的特征:
loans_2020=loans_2020.drop(["id","member_id","funded_amnt","funded_amnt_inv",
"grade","sub_grade","emp_title","issue_d"],axis=1)
loans_2020=loans_2020.drop(["zip_code","out_prncp","out_prncp_inv","total_pymnt",
"total_pymnt_inv","total_rec_prncp"],axis=1)
loans_2020=loans_2020.drop(["total_rec_int","total_rec_late_fee","recoveries","collection_recovery_fee",
"last_pymnt_d","last_pymnt_amnt"],axis=1)
print("现在的列数= ",loans_2020.shape[1])
输出:现在的列数= 32
之前是52列。
3、确定当前贷款状态
print(loans_2020["loan_status"].value_counts())#计算该列特征的属性的个数
输出:
Fully Paid 33693
Charged Off 5612
Current 201
Late (31-120 days) 10
In Grace Period 9
Late (16-30 days) 5
Default 1
将其做一个二分类,用0,1表示:
#做一个二分类,用0,1表示:
loans_2020=loans_2020[(loans_2020["loan_status"]=="Fully Paid")|(loans_2020["loan_status"]=="Charged Off")]
status_replace={
"loan_status":{
"Fully Paid":1,"Charged Off":0}}
#特征当做key,value里还有一个字典,第一个键值改为1,表示完全支付,第二个键改为0,表示违约
loans_2020=loans_2020.replace(status_replace)#执行的是查找并替换的操作;
loans_2020["loan_status"]
输出:
0 1
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。