赞
踩
由GroupLens实验室从MoviesLens收集,内容是20世纪90年代末带21世纪初的电影评分数据。这些数据包含电影的评分、电影的元数据(流派和年份)以及观众数据(年龄、邮编、性别、职业)。
首先我们可以使用pandas.read_table函数来将数据以DataFrame形式读入(文件读取教程):
import pandas as pd unames = ["user_id","gender","age","occupation","zip"] users = pd.read_csv("../pydata-book-2nd-edition/datasets/movielens/users.dat",sep="::",header=None,names=unames) rnames = ["user_id","movie_id","rating","timestamp"] ratings = pd.read_csv("../pydata-book-2nd-edition/datasets/movielens/ratings.dat",sep="::",header=None,names=rnames) mnames = ["movie_id","title","genres"] movies = pd.read_csv("../pydata-book-2nd-edition/datasets/movielens/movies.dat",sep="::",header=None,names=mnames) print(users.head()) # user_id gender age occupation zip # 0 1 F 1 10 48067 # 1 2 M 56 16 70072 # 2 3 M 25 15 55117 # 3 4 M 45 7 02460 # 4 5 M 25 20 55455 print(ratings.head()) # user_id movie_id rating timestamp # 0 1 1193 5 978300760 # 1 1 661 3 978302109 # 2 1 914 3 978301968 # 3 1 3408 4 978300275 # 4 1 2355 5 978824291 print(movies.head()) # movie_id title genres # 0 1 Toy Story (1995) Animation|Children's|Comedy # 1 2 Jumanji (1995) Adventure|Children's|Fantasy # 2 3 Grumpier Old Men (1995) Comedy|Romance # 3 4 Waiting to Exhale (1995) Comedy|Drama # 4 5 Father of the Bride Part II (1995) Comedy
为了方便之后的数据处理,我们需要将3个DataFrame合并成为1个DataFrame(DataFrame的合并教程)。观察到users和ratings都有user_id这一项,我们可以将其作为合并键,来进行合并,之后合并的数据和movies都有movies_id这一项,之后再以此作为键进行合并:
data = pd.merge(pd.merge(ratings,users),movies) print(data.head()) # user_id movie_id ... title genres # 0 1 1193 ... One Flew Over the Cuckoo's Nest (1975) Drama # 1 2 1193 ... One Flew Over the Cuckoo's Nest (1975) Drama # 2 12 1193 ... One Flew Over the Cuckoo's Nest (1975) Drama # 3 15 1193 ... One Flew Over the Cuckoo's Nest (1975) Drama # 4 17 1193 ... One Flew Over the Cuckoo's Nest (1975) Drama print(data.iloc[0]) # user_id 1 # movie_id 1193 # rating 5 # timestamp 978300760 # gender F # age 1 # occupation 10 # zip 48067 # title One Flew Over the Cuckoo's Nest (1975) # genres Drama # Name: 0, dtype: object
之后我们可以通过数据聚合的方式,来计算出男性和女性对每部电影的平均评分(数据聚合教程):
mean_data = data.groupby(["title","gender"])["rating"].mean()
print(mean_data.unstack().head())
# gender F M
# title
# $1,000,000 Duck (1971) 3.375000 2.761905
# 'Night Mother (1986) 3.388889 3.352941
# 'Til There Was You (1997) 2.675676 2.733333
# 'burbs, The (1989) 2.793478 2.962085
# ...And Justice for All (1979) 3.828571 3.689024
另外我们在数据分析的时候,往往会去掉一些样本不足的项,现在我们可以使用数据聚合的方式将评分数少于250的电影去掉:
ratings_by_title = data.groupby("title").size() #获取每个分组的大小 print(ratings_by_title.head()) # title # $1,000,000 Duck (1971) 37 # 'Night Mother (1986) 70 # 'Til There Was You (1997) 52 # 'burbs, The (1989) 303 # ...And Justice for All (1979) 199 # dtype: int64 active_titles = ratings_by_title.index[ratings_by_title >= 250] #删选出评分数大于250的索引 mean_data = mean_data.loc[active_titles] #通过索引选择行,得到筛选后的结果 print(mean_data.head()) # gender F M # title # 'burbs, The (1989) 2.793478 2.962085 # 10 Things I Hate About You (1999) 3.646552 3.311966 # 101 Dalmatians (1961) 3.791444 3.500000 # 101 Dalmatians (1996) 3.240000 2.911215 # 12 Angry Men (1957) 4.184397 4.328421
然后我们可以使用sort_values方法来得到女性最受欢迎的电影top10:
top_female_ratings = mean_data.sort_values(by="F",ascending=False)
print(top_female_ratings.head(10))
# gender F M
# title
# Close Shave, A (1995) 4.644444 4.473795
# Wrong Trousers, The (1993) 4.588235 4.478261
# Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.572650 4.464589
# Wallace & Gromit: The Best of Aardman Animation... 4.563107 4.385075
# Schindler's List (1993) 4.562602 4.491415
# Shawshank Redemption, The (1994) 4.539075 4.560625
# Grand Day Out, A (1992) 4.537879 4.293255
# To Kill a Mockingbird (1962) 4.536667 4.372611
# Creature Comforts (1990) 4.513889 4.272277
# Usual Suspects, The (1995) 4.513317 4.518248
我们可以新添加一列,其值为男性评分和女性评分的差值,以此来表示评分差异:
mean_data['diff'] = mean_data["M"] - mean_data["F"]
print(mean_data.sort_values(by="diff").head())
# gender F M diff
# # title
# # Dirty Dancing (1987) 3.790378 2.959596 -0.830782
# # Jumpin' Jack Flash (1986) 3.254717 2.578358 -0.676359
# # Grease (1978) 3.975265 3.367041 -0.608224
# # Little Women (1994) 3.870588 3.321739 -0.548849
# # Steel Magnolias (1989) 3.901734 3.365957 -0.535777
假设我们想要得到不依赖性别的最受争议的电影(评分差异大),则可以通过评分的标准差或者方差来计算:
rating_std_by_title = data.groupby("title")["rating"].std()
rating_std_by_title = rating_std_by_title.loc[active_titles]
print(rating_std_by_title.sort_values(ascending=False).head(10))
# title
# Dumb & Dumber (1994) 1.321333
# Blair Witch Project, The (1999) 1.316368
# Natural Born Killers (1994) 1.307198
# Tank Girl (1995) 1.277695
# Rocky Horror Picture Show, The (1975) 1.260177
# Eyes Wide Shut (1999) 1.259624
# Evita (1996) 1.253631
# Billy Madison (1995) 1.249970
# Fear and Loathing in Las Vegas (1998) 1.246408
# Bicentennial Man (1999) 1.245533
# Name: rating, dtype: float64
在附件babynames文件夹中,包含着1880年~2010你年出生的婴儿的名字、性别和该名字的婴儿的数量,并且只包含出现过超过5次的名字(每一年以一个txt文件的形式出现)。
首先我们以1880年的数据为例,对其进行读取:
names1880 = pd.read_csv("../pydata-book-2nd-edition/datasets/babynames/yob1880.txt",
names=["name","sex","number"])
print(names1880.head())
# name sex number
# 0 Mary F 7065
# 1 Anna F 2604
# 2 Emma F 2003
# 3 Elizabeth F 1939
# 4 Minnie F 1746
我们可以查看不同性别婴儿的出生数量:
print(names1880.groupby("sex").number.sum())
# F 90993
# M 110493
# Name: number, dtype: int64
由于数据分布在多个文件中(从yob1880.txt~yob2010.txt),首先需要做的事情就是讲数据集中在一个DataFrame中,并为每一行数据添加一个年份的标签:
years = range(1880,2010) pieces = [] colunms = ["mame","sex","number"] for year in years: path = "../pydata-book-2nd-edition/datasets/babynames/yob%d.txt" % year frame = pd.read_csv(path,names=colunms) frame["year"] = year pieces.append(frame) names = pd.concat(pieces,ignore_index=True) #将读取的各个frame按行连接(默认),并且重新建立行索引 print(names.head()) # mame sex number year # 0 Mary F 7065 1880 # 1 Anna F 2604 1880 # 2 Emma F 2003 1880 # 3 Elizabeth F 1939 1880 # 4 Minnie F 1746 1880
之后我们就可以进行数据的聚合了,比如每年的男婴和女婴的数量的统计,之后进行数据的可视化:
total_births = names.groupby(["year","sex"])["number"].sum().unstack() print(total_births.head(10)) # sex F M # year # 1880 90993 110493 # 1881 91955 100748 # 1882 107851 113687 # 1883 112322 104632 # 1884 129021 114445 # 1885 133056 107802 # 1886 144538 110785 # 1887 145983 101412 # 1888 178631 120857 # 1889 178369 110590 total_births.plot(title = "Total birth by sex and year") plt.show()
数据可视化结果如下:
现在我们插入一个prop列,表示该名字的婴儿该年同性别婴儿总数的百分比:
def add_prop(group):
group["prop"] = group.number / group.number.sum()
return group
names = names.groupby(["year","sex"]).apply(add_prop)
print(names.head())
# mame sex number year prop
# 0 Mary F 7065 1880 0.077643
# 1 Anna F 2604 1880 0.028618
# 2 Emma F 2003 1880 0.022013
# 3 Elizabeth F 1939 1880 0.021309
# 4 Minnie F 1746 1880 0.019188
为了之后数据分析方便,可以选取数量上每年/性别top1000的名字来进行之后的分析:
def get_top1000(group):
return group.sort_values(by="number",ascending=False)[:1000]
grouped = names.groupby(["year","sex"])
top1000 = grouped.apply(get_top1000)
top1000.reset_index(inplace=True,drop=True)
#去掉原始的分级索引,更换为默认索引
print(top1000.head())
# mame sex number year prop
# 0 Mary F 7065 1880 0.077643
# 1 Anna F 2604 1880 0.028618
# 2 Emma F 2003 1880 0.022013
# 3 Elizabeth F 1939 1880 0.021309
# 4 Minnie F 1746 1880 0.019188
如果我们想要将数据分离为男性和女性,可以进行以下操作:
boys = top1000[top1000["sex"] == "M"]
girls = top1000[top1000["sex"] == "F"]
我们也可以查看每个名字数量的变化趋势:
total_births = top1000.pivot_table("number",index="year",columns="name",aggfunc=sum)
#相当于top1000.groupby(["year","name"])["number"].sum().unstack()
print(total_births.head())
# name Aaden Aaliyah Aarav Aaron Ab ... Zola Zollie Zona Zora Zula
# year ...
# 1880 NaN NaN NaN 102.0 NaN ... 7.0 NaN 8.0 28.0 27.0
# 1881 NaN NaN NaN 94.0 NaN ... 10.0 NaN 9.0 21.0 27.0
# 1882 NaN NaN NaN 85.0 NaN ... 9.0 NaN 17.0 32.0 21.0
# 1883 NaN NaN NaN 105.0 NaN ... 10.0 NaN 11.0 35.0 25.0
# 1884 NaN NaN NaN 97.0 NaN ... 14.0 6.0 8.0 58.0 27.0
之后可以选取一些名字来查看其变化趋势:
subset = total_births[["John","Harry","Mary","Marilyn"]]
subset.plot(subplots = True,figsize = (12,10),grid = False,
title = "Number of births per year")
plt.show()
变化趋势如下:
可以看到,这四个名字随着时间的增加越来越罕见,原因我们之后会进一步讨论。
当我们对top1000的名字所占总数量的比例进行聚合的时候可以发现一个有趣的现象:
table = top1000.pivot_table("prop",index="year",columns="sex",aggfunc=sum)
#按性别和年份将prop分组求和
table.plot(title="Sum of table1000.prop by year and drc",
yticks = np.linspace(0,1.2,13),xticks=range(1880,2020,10))
可以看出,随着时间的变化,top1000的名字所占百分比逐渐下降,这说明了名字越来越多样性。
还有就是,我们如何能计算出前50%数量中包含了多少个不同的名字呢?我们可以先使用sort_values对prop进行从大到小的排列,之后使用cumsum(累加)方法对其进行累加,得到一个新的列,最后使用searchsorted方法来找出最后一个低于0.5的项的位次,该位次就是包含的名字的个数(事实上应该加1,因为索引是从0开始的),以2010年出生的男孩为例:
boys = top1000[top1000.sex == "M"] df = boys[boys.year == 2010] print(df.head()) # name sex number year prop # 260877 Jacob M 21875 2010 0.011523 # 260878 Ethan M 17866 2010 0.009411 # 260879 Michael M 17133 2010 0.009025 # 260880 Jayden M 17030 2010 0.008971 # 260881 William M 16870 2010 0.008887 prop_cumsum = df.sort_values(by="prop",ascending=False).prop.cumsum() print(prop_cumsum.head()) # 260877 0.011523 # 260878 0.020934 # 260879 0.029959 # 260880 0.038930 # 260881 0.047817 # Name: prop, dtype: float64 print(prop_cumsum.values.searchsorted(0.5)) # 116
当然我们可以通过groupby对每年的数据进行一次聚合,然后将其变化趋势进行可视化:
def get_quantile_count(group,q=0.5): group = group.sort_values(by="prop",ascending=False) return group.prop.cumsum().values.searchsorted(q) + 1 diversity = top1000.groupby(["year","sex"]).apply(get_quantile_count) diversity = diversity.unstack("sex") print(diversity.head()) # sex F M # year # 1880 38 14 # 1881 38 14 # 1882 38 15 # 1883 39 15 # 1884 39 16 diversity.plot(title="Number of popular names in top 50%") plt.show()
可以看出来,受欢迎的名字的种类越来越多,并且女孩的名字种类更加的多样化。
有研究指出,在过去的一百年中,男孩名字的最后一个字母的分布在过去一百年中发生了重大的变化。为了确认这一点,我们需要将数据按照年龄、性别和最后一个字母进行重新汇总:
get_last_letter = lambda x:x[-1] #匿名函数,用于取得最后一个字母 last_letters = names.name.map(get_last_letter) last_letters.name = "last_letter" table = names.pivot_table("number",index=last_letters, columns=["sex","year"],aggfunc=sum) subtable = table.reindex(columns=[1910,1960,2010],level="year") #选出具有代表性的三个年份 print(subtable.head()) # sex F M # year 1910 1960 2010 1910 1960 2010 # last_letter # a 108376.0 691247.0 670605.0 977.0 5204.0 28438.0 # b NaN 694.0 450.0 411.0 3912.0 38859.0 # c 5.0 49.0 946.0 482.0 15476.0 23125.0 # d 6750.0 3729.0 2607.0 22111.0 262112.0 44398.0 # e 133569.0 435013.0 313833.0 28655.0 178823.0 129012.0
接下来进行归一化处理,即一种名字数量在特定年份、特定性别下换成百分比,之后使用条形图对其进行数据的可视化:
import matplotlib.pyplot as plt letter_prop = subtable/subtable.sum() print(letter_prop.head()) # sex F M # year 1910 1960 2010 1910 1960 2010 # last_letter # a 0.273390 0.341853 0.381240 0.005031 0.002440 0.014980 # b NaN 0.000343 0.000256 0.002116 0.001834 0.020470 # c 0.000013 0.000024 0.000538 0.002482 0.007257 0.012181 # d 0.017028 0.001844 0.001482 0.113858 0.122908 0.023387 # e 0.336941 0.215133 0.178415 0.147556 0.083853 0.067959 fig,axes = plt.subplots(2,1,figsize=(10,8)) letter_prop["M"].plot(kind="bar",rot=0,ax=axes[0],title="Male") letter_prop["F"].plot(kind="bar",rot=0,ax=axes[1],title="Female") plt.show()
可以看出名字以n结尾的男生的数量自上个世纪60年代以来发生了一个显著的增长,我们可以使用之前的数据来查看n等字母在期间的具体增长趋势:
letter_prop = table/table.sum()
dny_ts = letter_prop.loc[["d","n","y"],"M"].T
#选择d、n、y三个行,M列,z之后进行转置
print(dny_ts.head())
# last_letter d n y
# year
# 1880 0.083055 0.153213 0.075760
# 1881 0.083247 0.153214 0.077451
# 1882 0.085340 0.149560 0.077537
# 1883 0.084066 0.151646 0.079144
# 1884 0.086120 0.149915 0.080405
dny_ts.plot()
另外一个有趣的趋势是,一些在早期在男性中流行的名字,随着时间的推移,逐渐变成了流行于女性的名字,如Lesley或者Leslie。为了直观的看到其变化,我们可以先筛选出以“lesl”开头的名字列表:
all_names = pd.Series(top1000.name.unique())
lesley_like = all_names[all_names.str.lower().str.contains("lesl")]
print(lesley_like)
# 632 Leslie
# 2294 Lesley
# 4262 Leslee
# 4728 Lesli
# 6103 Lesly
# dtype: object
之后我们可以将其他名字过滤掉,查看每年Lesley式名字在性别中的分布:
table = filtered.pivot_table("number",index="year", columns="sex",aggfunc=sum) print(table.head()) # sex F M # year # 1880 8.0 79.0 # 1881 11.0 92.0 # 1882 9.0 128.0 # 1883 7.0 125.0 # 1884 15.0 125.0 table = table.div(table.sum(1),axis=0) #用每一个元素除以改行元素的和 print(table.head()) # sex F M # year # 1880 0.091954 0.908046 # 1881 0.106796 0.893204 # 1882 0.065693 0.934307 # 1883 0.053030 0.946970 # 1884 0.107143 0.892857 table.plot(style={"M":"k-","F":"k--"}) #绘制变化 plt.show()
结果如下:
附件中给出的P00000001-ALL.csv是一个有关于2012年美国总统大选的文件,其包含着捐赠者姓名、职业、雇主、地址和缴费金额,我们可以使用pandas中的read_csv函数来进行加载:
fec = pd.read_csv("../pydata-book-2nd-edition/datasets/fec/P00000001-ALL.csv") print(fec.info()) # <class 'pandas.core.frame.DataFrame'> # RangeIndex: 1001731 entries, 0 to 1001730 # Data columns (total 16 columns): # cmte_id 1001731 non-null object # cand_id 1001731 non-null object # cand_nm 1001731 non-null object # contbr_nm 1001731 non-null object # contbr_city 1001712 non-null object # contbr_st 1001727 non-null object # contbr_zip 1001620 non-null object # contbr_employer 988002 non-null object # contbr_occupation 993301 non-null object # contb_receipt_amt 1001731 non-null float64 # contb_receipt_dt 1001731 non-null object # receipt_desc 14166 non-null object # memo_cd 92482 non-null object # memo_text 97770 non-null object # form_tp 1001731 non-null object # file_num 1001731 non-null int64 # dtypes: float64(1), int64(1), object(14) # memory usage: 122.3+ MB
我们可以使用unique快速获得所有不同的政治候选人名单(cand_nm项):
unique_cands = fec.cand_nm.unique()
print(unique_cands)
# ['Bachmann, Michelle' 'Romney, Mitt' 'Obama, Barack'
# "Roemer, Charles E. 'Buddy' III" 'Pawlenty, Timothy' 'Johnson, Gary Earl'
# 'Paul, Ron' 'Santorum, Rick' 'Cain, Herman' 'Gingrich, Newt'
# 'McCotter, Thaddeus G' 'Huntsman, Jon' 'Perry, Rick']
另外可以看到,数据中并没有各个参选人物的政党背景,我们可以对其进行手动的添加(以字典的形式):
parties = { 'Bachmann, Michelle':"Republican", 'Cain, Herman':"Republican", 'Gingrich, Newt':"Republican", 'Huntsman, Jon':"Republican", 'Johnson, Gary Earl':"Republican", 'McCotter, Thaddeus G':"Republican", 'Obama, Barack':'Democrat', 'Paul, Ron':"Republican", 'Pawlenty, Timothy':"Republican", 'Perry, Rick':"Republican", "Roemer, Charles E. 'Buddy' III":"Republican", 'Romney, Mitt':"Republican", 'Santorum, Rick':"Republican" }
当给map方法传入一个字典的时候,则会自动根据键值得到对应的数值,比如:
print(fec.cand_nm[123456:123461])
# 123456 Obama, Barack
# 123457 Obama, Barack
# 123458 Obama, Barack
# 123459 Obama, Barack
# 123460 Obama, Barack
# Name: cand_nm, dtype: object
print(fec.cand_nm[123456:123461].map(parties))
# 123456 Democrat
# 123457 Democrat
# 123458 Democrat
# 123459 Democrat
# 123460 Democrat
# Name: cand_nm, dtype: object
因此我们就可以根据上面的原理来新加入一列:
fec["party"] = fec.cand_nm.map(parties)
print(fec["party"].value_counts())
# Democrat 593746
# Republican 407985
# Name: party, dtype: int64
另外,在进行数据分析之前,需要的主意的是,数据中既包含捐款,也包含退款(负数),为了简化分析,我们将分析范围限制在捐款中:
print((fec.contb_receipt_amt > 0).value_counts())
# True 991475
# False 10256
# Name: contb_receipt_amt, dtype: int64
fec = fec[fec.contb_receipt_amt > 0]
另外,由于Barack Obama和Mitt Romney是主要竞选人,还特别准备了仅包含他们的数据集:
fec_mrbo = fec[fec.cand_nm.isin(['Obama, Barack','Romney, Mitt'])]
#isin代表若数据包含在输入的列表中,则返回Ture,否则返回False
首先我们可以获得各职业的总捐款次数:
print(fec.contbr_occupation.value_counts().head())
# RETIRED 234829
# INFORMATION REQUESTED 35176
# ATTORNEY 34409
# HOMEMAKER 30199
# PHYSICIAN 23530
# Name: contbr_occupation, dtype: int64
这里会出现一个问题,那就是一种职业可以用多种方法表示,比如对于“CEO”这个职业,可以表示为“CEO”,也可以表示为“C.E.O”,在这里我们要对这种情况进行处理,将其归为一类:
occ_mapping = {
"INFORMATION REQUESTED PER BEST EFFORT":"NOT PROVIEDE",
"INFORMATION REQUESTED":"NOT PROVIEDE",
"INFORMATION REQUESTED (BEST EFFORT)":"NOT PROVIEDE",
"C.E.O":"CEO"
}
f = lambda x:occ_mapping.get(x,x)
#get方法中,第一个参数是查找的键,若键存在则返回对应的数值,第二个参数是若不存在则返回的值
fec.contbr_occupation = fec.contbr_occupation.map(f)
当然,对于雇主这一列,其也会出现类似的情况,我们做相同的处理:
emp_mapping = {
"INFORMATION REQUESTED PER BEST EFFORT": "NOT PROVIEDE",
"INFORMATION REQUESTED": "NOT PROVIEDE",
"SELF":"SELF-EMPLOYED",
"SELF EMPLOYED":"SELF-EMPLOYED"
}
f = lambda x:emp_mapping.get(x,x)
fec.contbr_employer = fec.contbr_employer.map(f)
现在我们就可以通过pivot_table来按党派和职业对数据进行聚合,并且筛选出捐赠总支超过200w美元的职业并可视化:
by_occupation = fec.pivot_table("contb_receipt_amt",
index="contbr_occupation",
columns="party",aggfunc=sum)
over_2mm = by_occupation[by_occupation.sum(1) > 2000000]
print(over_2mm.head())
# party Democrat Republican
# contbr_occupation
# ATTORNEY 11141982.97 7477194.43
# C.E.O. 1690.00 2592983.11
# CEO 2074284.79 1640758.41
# CONSULTANT 2459912.71 2544725.45
# ENGINEER 951525.55 1818373.70
over_2mm.plot(kind="barh")
plt.show()
我们还可以使用之前使用过的方法来查看对Obama和Rommey捐赠最多的职业或者公司(如TOP5):
def get_top_amounts(group,key,n=5): totals = group.groupby(key)["contb_receipt_amt"].sum() return totals.nlargest(n) #nlargest(n)表示返回前n个最大的数据 grouped = fec_mrbo.groupby("cand_nm") #将原始数据分为两组 print(grouped.apply(get_top_amounts,"contbr_occupation",n=5)) # cand_nm contbr_occupation # Obama, Barack RETIRED 25305116.38 # ATTORNEY 11141982.97 # NOT PROVIEDE 4866973.96 # HOMEMAKER 4248875.80 # PHYSICIAN 3735124.94 # Romney, Mitt RETIRED 11508473.59 # NOT PROVIEDE 11396894.84 # HOMEMAKER 8147446.22 # ATTORNEY 5364718.82 # PRESIDENT 2491244.89 # Name: contb_receipt_amt, dtype: float64
之后便可以方便的进行数据可视化:
Obama = grouped.apply(get_top_amounts,"contbr_occupation",n=5)["Obama, Barack"]
print(Obama)
Obama.plot.bar()
plt.show()
可以使用cut函数将贡献者的数量按照贡献值大小离散化分桶:
bins = np.array([0,1,10,100,1000,10000,100000,1000000,10000000]) labels = pd.cut(fec_mrbo.contb_receipt_amt,bins) grouped = fec_mrbo.groupby(["cand_nm",labels]) res = grouped.size().unstack(0) print(res) # cand_nm Obama, Barack Romney, Mitt # contb_receipt_amt # (0, 1] 493.0 77.0 # (1, 10] 40070.0 3681.0 # (10, 100] 372280.0 31853.0 # (100, 1000] 153991.0 43357.0 # (1000, 10000] 22284.0 26186.0 # (10000, 100000] 2.0 1.0 # (100000, 1000000] 3.0 NaN # (1000000, 10000000] 4.0 NaN
可以看出来,奥巴马获得的捐款数量比另外一个人多得多。我们还可以查看一下每个人在每种数额的捐款中占所得的比例,之后对其进行可视化:
bucket_sums = grouped.contb_receipt_amt.sum().unstack(0) normed_sums = bucket_sums.div(bucket_sums.sum(axis=1),axis=0) print(normed_sums) # cand_nm Obama, Barack Romney, Mitt # contb_receipt_amt # (0, 1] 0.805182 0.194818 # (1, 10] 0.918767 0.081233 # (10, 100] 0.910769 0.089231 # (100, 1000] 0.710176 0.289824 # (1000, 10000] 0.447326 0.552674 # (10000, 100000] 0.823120 0.176880 # (100000, 1000000] 1.000000 NaN # (1000000, 10000000] 1.000000 NaN normed_sums[:-2].plot(kind="barh") plt.show()
我们可以统计按照州以及候选人来对数据进行聚合,来查看每个州对每个候选人的狷狂数量:
grouped = fec_mrbo.groupby(["cand_nm","contbr_st"])
totals = grouped.contb_receipt_amt.sum().unstack(0).fillna(0)
#将NA用0来替换
totals = totals[totals.sum(1) > 100000]
#筛选狷狂总额大于100000美元的州
print(totals.head())
# cand_nm Obama, Barack Romney, Mitt
# contbr_st
# AK 281840.15 86204.24
# AL 543123.48 527303.51
# AR 359247.28 105556.00
# AZ 1506476.98 1888436.23
# CA 23824984.24 11237636.60
同样的可以获得两个候选人在每个州捐款中所占百分比:
percent = totals.div(totals.sum(1),axis=0)
print(percent.head())
# cand_nm Obama, Barack Romney, Mitt
# contbr_st
# AK 0.765778 0.234222
# AL 0.507390 0.492610
# AR 0.772902 0.227098
# AZ 0.443745 0.556255
# CA 0.679498 0.320502
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。