当前位置:   article > 正文

python数据分析实战——kiva贷款数据_信贷公开数据集

信贷公开数据集

1.贷款数据集介绍

        导入所使用的的库

        Plotly中的graph_objs是Plotly下的子模块,用于导入Plotly中所有图像对象,在导入相应的图形对象之后,便可以根据需要呈现的数据和自定义的图形规格参数来定义一个graph对象,再输入plotly.offline.iplot()中进行最终的呈现。

  1. import pandas as pd
  2. import numpy as np
  3. import matplotlib
  4. import matplotlib.pyplot as plt # for plotting
  5. import seaborn as sns # for making plots with seaborn
  6. color = sns.color_palette() # 调色板
  7. import plotly.offline as py
  8. py.init_notebook_mode(connected=True)
  9. import plotly.graph_objs as go
  10. import plotly.offline as offline
  11. offline.init_notebook_mode()
  12. import plotly.tools as tls
  13. import squarify
  14. from mpl_toolkits.basemap import Basemap
  15. from numpy import array
  16. from matplotlib import cm
  17. # Supress unnecessary warnings so that presentation looks clean
  18. import warnings
  19. warnings.filterwarnings("ignore")
  20. # Print all rows and columns
  21. pd.set_option('display.max_columns', None)
  22. pd.set_option('display.max_rows', None)
  23. %matplotlib inline

数据集的情况:

  1. kiva_loans_data = pd.read_csv("kiva_loans.csv") # 贷款数据集
  2. kiva_mpi_locations_data = pd.read_csv("kiva_mpi_region_locations.csv") # 贷款人地理信息
  3. loan_theme_ids_data = pd.read_csv("loan_theme_ids.csv") # 贷款主要用途
  4. loan_themes_by_region_data = pd.read_csv("loan_themes_by_region.csv") # 地理信息与贷款用途
  5. #loans_data = pd.read_csv("loans.csv")
  6. lenders_data = pd.read_csv("lenders.csv") # 贷款方数据
  7. loans_lenders_data = pd.read_csv("loans_lenders.csv") # 贷款借款方
  8. country_stats_data = pd.read_csv("country_stats.csv") # 国家统计数据
  9. mpi_national_data = pd.read_csv("MPI_national.csv") #国家多维贫困指数
  10. mpi_subnational_data = pd.read_csv("MPI_subnational.csv") #贫困指数低的国家或地区

数据集的数据量:

  1. print("Size of kiva_loans_data",kiva_loans_data.shape)
  2. print("Size of kiva_mpi_locations_data",kiva_mpi_locations_data.shape)
  3. print("Size of loan_theme_ids_data",loan_theme_ids_data.shape)
  4. print("Size of loan_themes_by_region_data",loan_themes_by_region_data.shape)
  5. print("***** Additional kiva snapshot******")
  6. #print("Size of loans_data",loans_data.shape)
  7. print("Size of lenders_data",lenders_data.shape)
  8. print("Size of loans_lenders_data",loans_lenders_data.shape)
  9. print("Size of country_stats_data",country_stats_data.shape)
  10. print("*****Multidimensional Poverty Measures Data set******")
  11. print("Size of mpi_national_data",mpi_national_data.shape)
  12. print("Size of mpi_subnational_data",mpi_subnational_data.shape)

数据集概况:

include=["0"]将所有的指标都展示出来

 检查缺失值,算出所有缺失值的个数,进行排序,并计算出缺失值比例

可以看出,处理tags以外,其他数值的缺失值较少

地区与贫困指数数据缺失值较多

贷款数据集缺失值较少

贷款用途与地区的数据中,geocode_old数据与mpi_geo缺失值较多,其他缺失值较少 

2、数据可视化 

1.贷款主要用途

  1. plt.figure(figsize=(15,8))
  2. sector_name = kiva_loans_data['sector'].value_counts()
  3. sns.barplot(sector_name.values, sector_name.index)
  4. for i, v in enumerate(sector_name.values):
  5. plt.text(0.8,i,v,color='k',fontsize=19)
  6. plt.xticks(rotation='vertical')
  7. plt.xlabel('Number of loans were given')
  8. plt.ylabel('Sector Name')
  9. plt.title("Top sectors in which more loans were given")
  10. plt.show()

        排名最前的是农业、食物、零售、服务、房子、衣服、教育等生活必需品 

        更加直观的图像:

  1. plt.figure(figsize=(15,8))
  2. count = kiva_loans_data['sector'].value_counts()
  3. squarify.plot(sizes=count.values,label=count.index, value=count.values)
  4. plt.title('Distribution of sectors')

明细的用途:

  1. plt.figure(figsize=(15,8))
  2. count = kiva_loans_data['use'].value_counts().head(10)
  3. sns.barplot(count.values, count.index, )
  4. for i, v in enumerate(count.values):
  5. plt.text(0.8,i,v,color='k',fontsize=19)
  6. plt.xlabel('Count', fontsize=12)
  7. plt.ylabel('uses of loans', fontsize=12)
  8. plt.title("Most popular uses of loans", fontsize=16)

水源、食物、药品最多 

2.还款的情况 

        有钱就还以及月付最多

4.那些国家借款最多 

        菲律宾,肯尼亚等贫穷落后的国家对贷款的需求最多

5.贷款的多少 

  1. plt.figure(figsize = (12, 8))
  2. plt.scatter(range(kiva_loans_data.shape[0]), np.sort(kiva_loans_data.funded_amount.values))
  3. plt.xlabel('index', fontsize=12)
  4. plt.ylabel('loan_amount', fontsize=12)
  5. plt.title("Loan Amount Distribution")
  6. plt.show()

绝大多数人贷款额度比较小,贷款额度在20000以下,极小的点贷款额度较高

6.各个地区的需求情况 

撒哈拉以南非洲地区需求比较大,欧洲和中亚地区基本没什么需求

7.贷款人的数量分布 

        放款的人1个人,5-10人比较多

8.贷款的明细目的 

一般的商店和农业贷款比较多

9.多久能还款

 

 8个月,14个月还款的比较多。

9.性别比例

  1. gender_list = []
  2. for gender in kiva_loans_data["borrower_genders"].values:
  3. if str(gender) != "nan":
  4. gender_list.extend( [lst.strip() for lst in gender.split(",")] )
  5. temp_data = pd.Series(gender_list).value_counts()
  6. labels = (np.array(temp_data.index))
  7. sizes = (np.array((temp_data / temp_data.sum())*100))
  8. plt.figure(figsize=(15,8))
  9. trace = go.Pie(labels=labels, values=sizes)
  10. layout = go.Layout(title='Borrower Gender')
  11. data = [trace]
  12. fig = go.Figure(data=data, layout=layout)
  13. py.iplot(fig, filename="BorrowerGender")

贷款女性居多

11.平均的额度 

  1. kiva_loans_data.borrower_genders = kiva_loans_data.borrower_genders.astype(str)
  2. gender_data = pd.DataFrame(kiva_loans_data.borrower_genders.str.split(',').tolist())
  3. kiva_loans_data['sex_borrowers'] = gender_data[0]
  4. kiva_loans_data.loc[kiva_loans_data.sex_borrowers == 'nan', 'sex_borrowers'] = np.nan
  5. sex_mean = pd.DataFrame(kiva_loans_data.groupby(['sex_borrowers'])['funded_amount'].mean().sort_values(ascending=False)).reset_index()
  6. print(sex_mean)
  7. g1 = sns.barplot(x='sex_borrowers', y='funded_amount', data=sex_mean)
  8. g1.set_title("Mean funded Amount by Gender ", fontsize=15)
  9. g1.set_xlabel("Gender")
  10. g1.set_ylabel("Average funded Amount(US)", fontsize=12)

男性的平均额度较多

  1. f, ax = plt.subplots(figsize=(15, 5))
  2. print("Genders count with repayment interval monthly\n",kiva_loans_data['sex_borrowers'][kiva_loans_data['repayment_interval'] == 'monthly'].value_counts())
  3. print("Genders count with repayment interval weekly\n",kiva_loans_data['sex_borrowers'][kiva_loans_data['repayment_interval'] == 'weekly'].value_counts())
  4. print("Genders count with repayment interval bullet\n",kiva_loans_data['sex_borrowers'][kiva_loans_data['repayment_interval'] == 'bullet'].value_counts())
  5. print("Genders count with repayment interval irregular\n",kiva_loans_data['sex_borrowers'][kiva_loans_data['repayment_interval'] == 'irregular'].value_counts())
  6. sns.countplot(x="sex_borrowers", hue='repayment_interval', data=kiva_loans_data).set_title('sex borrowers with repayment_intervals');

 男性一次性还款的居多

12.不同国家的贷款情况

  1. countries_funded_amount = kiva_loans_data.groupby('country').mean()['funded_amount'].sort_values(ascending = False)
  2. print("Top Countries with funded_amount(Dollar value of loan funded on Kiva.org)(Mean values)\n",countries_funded_amount.head(10))

  1. data = [dict(
  2.         type='choropleth',
  3.         locations= countries_funded_amount.index,
  4.         locationmode='country names',
  5.         z=countries_funded_amount.values,
  6.         text=countries_funded_amount.index,
  7.         colorscale='Red',
  8.         marker=dict(line=dict(width=0.7)),
  9.         colorbar=dict(autotick=False, tickprefix='', title='Top Countries with funded_amount(Mean value)'),
  10. )]
  11. layout = dict(title = 'Top Countries with funded_amount(Dollar value of loan funded on Kiva.org)',
  12.              geo = dict(
  13.             showframe = False,
  14.             #showcoastlines = False,
  15.             projection = dict(
  16.                 type = 'Mercatorodes'
  17.             )
  18.         ),)
  19. fig = dict(data=data, layout=layout)
  20. py.iplot(fig, validate=False

13.各种情况下平均贷款情况

 

 

14.哪些国家在数据集中比较抢眼呢 

  1. from wordcloud import WordCloud
  2. names = kiva_loans_data["country"][~pd.isnull(kiva_loans_data["country"])]
  3. #print(names)
  4. wordcloud = WordCloud(max_font_size=50, width=600, height=300).generate(' '.join(names))
  5. plt.figure(figsize=(15,8))
  6. plt.imshow(wordcloud)
  7. plt.title("Wordcloud for country Names", fontsize=35)
  8. plt.axis("off")
  9. plt.show()

15.还款方式随时间的变动 

  1. kiva_loans_data['date'] = pd.to_datetime(kiva_loans_data['date'])
  2. kiva_loans_data['date_month_year'] = kiva_loans_data['date'].dt.to_period("M")
  3. plt.figure(figsize=(8,10))
  4. g1 = sns.pointplot(x='date_month_year', y='loan_amount',
  5. data=kiva_loans_data, hue='repayment_interval')
  6. g1.set_xticklabels(g1.get_xticklabels(),rotation=90)
  7. g1.set_title("Mean Loan by Month Year", fontsize=15)
  8. g1.set_xlabel("")
  9. g1.set_ylabel("Loan Amount", fontsize=12)
  10. plt.show()

一次性偿还的方式比较多

14.不同国家贷款情况随时间的变化 

        

  1. kiva_loans_data['Century'] = kiva_loans_data.date.dt.year
  2. loan = kiva_loans_data.groupby(['country', 'Century'])['loan_amount'].mean().unstack()
  3. loan = loan.sort_values([2017], ascending=False)
  4. f, ax = plt.subplots(figsize=(15, 20))
  5. loan = loan.fillna(0)
  6. temp = sns.heatmap(loan, cmap='Reds')
  7. plt.show()

 有些国家借款随着年份有着比较大的差异,可能是由于战乱、自然灾害等因素引起的

15.不同种类的还款方式对比

  1. sector_repayment = ['sector', 'repayment_interval']
  2. cm = sns.light_palette("red", as_cmap=True)
  3. pd.crosstab(kiva_loans_data[sector_repayment[0]], kiva_loans_data[sector_repayment[1]]).style.background_gradient(cmap = cm) # 混淆矩阵

 16.贷款金额与批下的金额的差异性

  1. kiva_loans_data.index = pd.to_datetime(kiva_loans_data['posted_time'])
  2. plt.figure(figsize = (12, 8))
  3. ax = kiva_loans_data['loan_amount'].resample('w').sum().plot()
  4. ax = kiva_loans_data['funded_amount'].resample('w').sum().plot()
  5. ax.set_ylabel('Amount ($)')
  6. ax.set_xlabel('month-year')
  7. ax.set_xlim((pd.to_datetime(kiva_loans_data['posted_time'].min()),
  8. pd.to_datetime(kiva_loans_data['posted_time'].max())))
  9. ax.legend(["loan amount", "funded amount"])
  10. plt.title('Trend of loan amount V.S. funded amount')
  11. plt.show()

17.针对个别地区 

  1. loan_use_in_india = kiva_loans_data['use'][kiva_loans_data['country'] == 'India']
  2. percentages = round(loan_use_in_india.value_counts() / len(loan_use_in_india) * 100, 2)[:13]
  3. trace = go.Pie(labels=percentages.keys(), values=percentages.values, hoverinfo='label+percent',
  4. textfont=dict(size=18, color='#000000'))
  5. data = [trace]
  6. layout = go.Layout(width=800, height=800, title='Top 13 loan uses in India',titlefont= dict(size=20),
  7. legend=dict(x=0.1,y=-0.7))
  8. fig = go.Figure(data=data, layout=layout)
  9. offline.iplot(fig, show_link=False)

 

 在印度最大的贷款用途是购买无烟炉,然后通过购买布料和缝纫机来扩大她的剪裁业务。

18.贷款最多的7个地区

  1. # Plotting these Top 7 funded regions on India map. Circles are sized according to the
  2. # regions of the india
  3. plt.subplots(figsize=(20, 15))
  4. map = Basemap(width=4500000,height=900000,projection='lcc',resolution='l',
  5. llcrnrlon=67,llcrnrlat=5,urcrnrlon=99,urcrnrlat=37,lat_0=28,lon_0=77)
  6. map.drawmapboundary ()
  7. map.drawcountries ()
  8. map.drawcoastlines ()
  9. lg=array(top7_cities['lon'])
  10. lt=array(top7_cities['lat'])
  11. pt=array(top7_cities['amount'])
  12. nc=array(top7_cities['region'])
  13. x, y = map(lg, lt)
  14. population_sizes = top7_cities["amount"].apply(lambda x: int(x / 3000))
  15. plt.scatter(x, y, s=population_sizes, marker="o", c=population_sizes, alpha=0.9)
  16. for ncs, xpt, ypt in zip(nc, x, y):
  17. plt.text(xpt+60000, ypt+30000, ncs, fontsize=20, fontweight='bold')
  18. plt.title('Top 7 funded regions in India',fontsize=30)

19.贫苦指数 

  1. data = [ dict(
  2. type = 'scattergeo',
  3. lat = kiva_mpi_locations_data['lat'],
  4. lon = kiva_mpi_locations_data['lon'],
  5. text = kiva_mpi_locations_data['LocationName'],
  6. marker = dict(
  7. size = 10,
  8. line = dict(
  9. width=1,
  10. color='rgba(102, 102, 102)'
  11. ),
  12. cmin = 0,
  13. color = kiva_mpi_locations_data['MPI'],
  14. cmax = kiva_mpi_locations_data['MPI'].max(),
  15. colorbar=dict(
  16. title="Multi-dimenstional Poverty Index"
  17. )
  18. ))]
  19. layout = dict(title = 'Multi-dimensional Poverty Index for different regions')
  20. fig = dict( data=data, layout=layout )
  21. py.iplot(fig)

19.人类发展指数 

data = [dict(
        type='choropleth',
        locations= country_stats_data['country_name'],
        locationmode='country names',
        z=country_stats_data['hdi'],
        text=country_stats_data['country_name'],
        colorscale='Red',
        marker=dict(line=dict(width=0.7)),
        colorbar=dict(autotick=False, tickprefix='', title='Human Development Index(HDI)'),
)]
layout = dict(title = 'Human Development Index(HDI) for different countries',)
fig = dict(data=data, layout=layout)
py.iplot(fig, validate=False)

20.不同国家贫穷对比 

 

  1. data = [dict(
  2. type='choropleth',
  3. locations= country_stats_data['country_name'],
  4. locationmode='country names',
  5. z=country_stats_data['population_below_poverty_line'],
  6. text=country_stats_data['country_name'],
  7. colorscale='Red',
  8. marker=dict(line=dict(width=0.7)),
  9. colorbar=dict(autotick=False, tickprefix='', title='population_below_poverty_line in %'),
  10. )]
  11. layout = dict(title = 'Population below poverty line for different countries in % ',)
  12. fig = dict(data=data, layout=layout)
  13. py.iplot(fig, validate=False)

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/小惠珠哦/article/detail/746608
推荐阅读
相关标签
  

闽ICP备14008679号