赞
踩
Python数据分析主要基于Pandas的Serise和DataFrame去实现,Serise和DataFrame有点像php里的数组,在数据科学里叫矩阵。
把数据使用Pandas进行采样和机器学习计算,在使用Matplotlib/Seaborn来画图实现。
1 x n
或者 n x 1
的矩阵1 x 1
的矩阵m x p
的矩阵,B为p x n
的矩阵,m x n
的矩阵C为A与B的乘积,记为C=AB
,其中矩阵C中的第i行第j列元素。Pandas 的主要数据结构是 Series (一维数据)与 DataFrame(二维数据),这两种数据结构足以处理金融、统计、社会科学、工程等领域里的大多数典型用例。
Series 是一种类似于一维数组的对象,它由一组数据(各种Numpy数据类型)以及一组与之相关的数据标签(即索引)组成。
DataFrame 是一个表格型的数据结构,它含有一组有序的列,每列可以是不同的值类型(数值、字符串、布尔型值)。DataFrame 既有行索引也有列索引,它可以被看做由 Series 组成的字典(共同用一个索引)。
Pandas Series 类似表格中的一个列(column),类似于一维数组,可以保存任何数据类型。
Series 由索引(index)和列组成,函数如下:
pandas.Series( data, index, dtype, name, copy)
DataFrame 是一个表格型的数据结构,它含有一组有序的列,每列可以是不同的值类型(数值、字符串、布尔型值)。DataFrame 既有行索引也有列索引,它可以被看做由 Series 组成的字典(共同用一个索引)。
pandas.DataFrame( data, index, columns, dtype, copy)
Series的计算
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
S1 = Series([1,2,3],index=['A','B','C'])
S2 = Series([4,5,6,7],index=['B','C','D','E'])
// S1 + S2 的执行结果:
B 8
C 10
D 12
E 14
dtype: int64
DataFrame的运算:
import numpy as np import pandas as pd from pandas import Series,DataFrame df1 = DataFrame(np.arange(4).reshape(2,2),index=['A','B'],columns =['BJ','SH']) df2 = DataFrame(np.arange(9).reshape(3,3),index=['A','B','C'],columns =['BJ','SH','GZ']) //df1 + df2 执行结果 BJ GZ SH A 0.0 NaN 2.0 B 5.0 NaN 7.0 C NaN NaN NaN df3 = DataFrame([[1,2,3],[4,5,np.nan],[7,8,9]],index=['A','B','C'],columns=['c1','c2','c3']) //求列的和 df3.sum() //求行的和 df3.sum(axis =1) //求列的最小值 df3.min() //求列的最大值 df3.max() //求列的统计值 df3.describe() // 执行结果: c1 c2 c3 count 3.0 3.0 2.000000 mean 4.0 5.0 6.000000 std 3.0 3.0 4.242641 min 1.0 2.0 3.000000 25% 2.5 3.5 4.500000 50% 4.0 5.0 6.000000 75% 5.5 6.5 7.500000 max 7.0 8.0 9.000000
Series可以根据value和index去排序
import numpy as np
import pandas as pd
from pandas import Series , DataFrame
s1 = Series(np.random.randn(10))
// 可以执行 s1.values / s1.index的值
s1.sort_index()
//ascending=False 值从大小排序
s1.sort_values(ascending=False)
导入csv文件进行排序:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
csv_user = '/Users/zhangyu/Desktop/user.csv'
pd.read_csv(csv_user,low_memory=False)[["user_id","user_name","user_status","user_nickname","user_display"]].sort_values('user_display',ascending=False)
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
csv_user = '/Users/zhangyu/Desktop/user.csv'
pd.read_csv(csv_user,low_memory=False)[["user_id","user_name","user_status","user_nickname","user_display"]].sort_values('user_display',ascending=False)
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
df1 = DataFrame(np.arange(9).reshape(3,3),index=['BJ','SH','SZ'],columns=['A','B','C'])
//第1种方式
df1.index = Series(['bj','sh','sz'])
//第2种方式
df1.index = df1.index.map(str.upper)
//第3种方式
df1.renamee(index={},columns={})
DataFrame对结果集进行merge的时候,要有相同的columns,不然结果是空的。
import pandas as pd
import numpy as np
from pandas import Series,DataFrame
df1 = DataFrame({'key':['X','Y','Z'],'data_set_1':[1,2,3]})
df2 = DataFrame({'key':['X','B','C'],'data_set_2':[4,5,6]})
pd.merge(df1,df2)
//执行结果:
key data_set_1 data_set_2
0 X 1 4
如果是有相同的columns,会执行2遍:
df3 = DataFrame({'key':['X','Y','X'],'data_set_1':[1,2,3]})
df4 = DataFrame({'key':['X','B','C'],'data_set_2':[4,5,6]})
pd.merge(df3,df4)
//执行结果:
key data_set_1 data_set_2
0 X 1 4
1 X 3 4
DataFrame merge函数有两个重要的参数要重点介绍,on
就是对于那个字典合并,how
就是对那部分进行补全,和Mysql的join有点相似。
pd.merge(df1,df2,on='key',how="outer")
import pandas as pd import numpy as np from pandas import Series,DataFrame arr1 = np.arange(9).reshape(3,3) arr2 = np.arange(9).reshape(3,3) np.concatenate([arr1,arr2]) //执行结果: array([[0, 1, 2], [3, 4, 5], [6, 7, 8], [0, 1, 2], [3, 4, 5], [6, 7, 8]]) np.concatenate([arr1,arr2],axis=1) //执行结果: array([[0, 1, 2, 0, 1, 2], [3, 4, 5, 3, 4, 5], [6, 7, 8, 6, 7, 8]]) s1 = Series([1,2,3],index=['X','Y','Z']) s2 = Series([4,5],index=['A','B']) np.concatenate((s1,s2)) //执行结果: array([1, 2, 3, 4, 5]) pd.concat([s1,s2]) pd.concat([s1,s2],axis=1) df1 = DataFrame(np.random.randn(4,3),columns=['X','Y','Z']) df2 = DataFrame(np.random.randn(3,3),columns=['X','Y','A']) pd.concat([df1,df2]) //执行结果: X Y Z A 0 1.988333 0.983833 -1.927185 NaN 1 0.958006 -0.927374 -0.294826 NaN 2 1.748977 -0.162052 -0.975951 NaN 3 -1.476630 0.985365 0.128454 NaN 0 -0.717199 0.371737 NaN 0.957216 1 1.149086 1.761177 NaN -0.601090 2 -2.731751 0.321025 NaN -0.186762
Combine的使用
df1.combine_first(df2)
的函数意义是把df2填充到df1里
s1 = Series([2,np.nan,4,np.nan],index=['A','B','C','D']) s2 = Series([1,2,3,4],index=['A','B','C','D']) s1.combine_first(s2) //执行结果: A 2.0 B 2.0 C 4.0 D 4.0 dtype: float64 df1 = DataFrame({ 'x': [1,np.nan,3,np.nan], 'y': [5,np.nan,7,np.nan], 'z': [9,np.nan,11,np.nan], }) df2 = DataFrame({ 'Z':[np.nan,10,np.nan,12], 'A':[1,2,3,4] }) df1.combine_first(df2)
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
df = pd.read_csv('/Users/zhangyu/Desktop/user.csv',low_memory=False)
def foo(line):
if line is not None:
items = str(line).strip().split('/')
return Series([items[0]])
df["user_head"].apply(foo)
len + unique
的组合使用是计算重复数据的长度,drop_duplicates
里的keep
如果等于last,取(重复的)最后一条,默认取第一条。
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
df = pd.read_csv('/Users/zhangyu/Desktop/user.csv',low_memory=False)
flied = ['user_id','user_name','user_status','user_display'];
df.head(10)[field]
len(df['user_status'].unique())
df['user_display'].duplicated()
df['user_display'].drop_duplicates()
df.head(30)[flied].drop_duplicates(['user_display'],keep="last")
可以使用s1['2016-09']
来获取Series的值
import numpy as np import pandas as pd from pandas import Series,DataFrame from datetime import datetime date_list = [ datetime(2016,9,1), datetime(2016,9,10), datetime(2017,9,1), datetime(2017,9,20), datetime(2017,10,1) ] s1 = Series(np.random.rand(5),index=date_list) days = pd.date_range('2016-01-01',periods=100) s2 = Series(np.random.rand(100),index=days)
import numpy as np import pandas as pd from pandas import Series, DataFrame import matplotlib.pyplot as plt t_range = pd.date_range('2016-01-01','2016-12-31') s1 = Series(np.random.randn(len(t_range)),index=t_range) s1_mouth = s1.resample('M').mean() // 按小时画图 t_range = pd.date_range('2016-01-01','2016-12-31',freq='H') stock_df = DataFrame(index=t_range) stock_df['alibaba'] = np.random.randint(80,160,size=8761) stock_df['baidu'] = np.random.randint(30,60,size=len(t_range)) stock_df.plot() //按周数据取样 weekly_df = DataFrame() weekly_df['alibaba'] = stock_df['alibaba'].resample('W').mean() weekly_df['baidu'] = stock_df['baidu'].resample('W').mean() weekly_df.plot()
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
import pandas.util.testing
score_list = np.random.randint(25,100,size=20)
bins = [0,59,75,90,100]
score_cat = pd.cut(score_list,bins)
df = DataFrame()
df['score'] = score_list
df['student'] = [pd.util.testing.rands(3) for i in range(20)]
df['section'] = pd.cut(df['score'],bins,labels=['Low','Pass','Good','Greate'])
使用groupby
分组的g,常用的使用方法:
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
df = pd.read_csv('/Users/zhangyu/Desktop/type.csv',low_memory=False)
g = df.groupby(df['type_topid'])
//执行g
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fc53d997520>
//转化为列表,使用for循环
list(g)
for name , group_df in g:
print(name)
print(group_df)
df.agg()
为了更好的去展示数据,临时把原始表结构进行了一次变更,根据这些变化,也产生相应的变化,这是透视表的概念。
使用函数pivot_table
实现透视表。
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
df = pd.read_csv('/Users/zhangyu/Desktop/type.csv',low_memory=False)
pd.pivot_table(df,index=['type_id','type_name'])
为什么用python画图?
什么是Matplotlib?
plot的参数,是由X轴和Y轴组成的,x和y的元素数量需要对应的。plot
可以把两条线放在一张图里。
import numpy as np import matplotlib.pyplot as plt x = [1,2,3] y = [4,5,6] plt.plot(x,y) c = [10,8,6] d = [1,8,3] plt.plot(a,b,c,d) plt.plot(t,s,'r--',label='demo_aaa') plt.plot(t*2,s ,'b--',label='demo_bbb') plt.xlabel('this is x') plt.ylabel('this is y') plt.title('this is a demo') plt.legend()
subplot的实现原理就是把一个整图切换成几个合理的子图。
import numpy as np import matplotlib.pyplot as plt x = np.linspace(0.0,5.0) y1 = np.sin(np.pi*x) y2 = np.sin(np.pi*x*2) # 在一张画布里 plt.plot(x,y1,'b--',label='sin(pi*x)') plt.ylabel('y1 value') plt.plot(x,y2,'r--',label='sin(pi*2x)') plt.ylabel('y2 value') plt.xlabel('x value') plt.title('this is x-y value') plt.legend() # 构建subplot画布 plt.subplot(2,2,1) plt.plot(x,y1,'b--') plt.ylabel('y1') plt.subplot(2,2,2) plt.plot(x,y2,'r--') plt.ylabel('y2') plt.xlabel('x') plt.subplot(2,2,3) plt.plot(x,y1,'r*') plt.subplot(2,2,4) plt.plot(x,y1)
# 画布
figure , ax = plt.subplots(2,2)
ax[0][0].plot(x,y1)
ax[0][1].plot(x,y2)
可以对Series 进行绘图,s1.plot()
,常用的参数如下:
plt.legend()
合并使用才会显示import numpy as np
import pandas as pd
from pandas import Series
import matplotlib.pyplot as plt
s1 = Series(np.random.randn(1000)).cumsum()
s2 = Series(np.random.randn(1000)).cumsum()
s1.plot(kind='line',grid=True,label="s1",title="this is Serise")
s2.plot(kind='line',grid=True,label="s2",title="this is Serise")
plt.legend()
figure , ax = plt.subplots(2,1)
ax[0].plot(s1)
ax[1].plot(s2)
s1.plot(ax=ax[0],label='S1')
s2.plot(ax=ax[1],label='S2')
使用df.plot
来画图,常用参数解释:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import Series,DataFrame
df = DataFrame(
np.random.randint(1,10,40).reshape(10,4),
columns=['A','B','C','D']
)
df.plot(kind="bar")
df.plot(kind="bar",stacked=True)
使用plt.hist
函数来实现直方图,常用参数如下:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import Series,DataFrame
s = Series(np.random.randn(1000))
plt.hist(s,rwidth=0.9,bins=100,color='r')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import Series,DataFrame
import seaborn as sns
s1 = Series(np.random.randn(1000))
sns.distplot(s1,bins=30,hist=True,kde=True,rug=True)
sns.kdeplot(s1,shade=True,color='#00FFFF')
需要的数据可以去https://github.com/mwaskom/seaborn-data
下载。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import Series,DataFrame
import seaborn as sns
df = pd.read_csv('/Users/zhangyu/Downloads/seaborn-data-master/flights.csv')
//做一个年度的透视表
df = df.pivot(index='month',columns='year',values='passengers')
sns.heatmap(df,annot=True,fmt='d')
//seaborn实现一个柱状图
s = df.sum()
sns.barplot(x=s.index,y=s.values)
import seaborn as sns
plt.plot(x,y1)
plt.plot(x,y2)
style = ['darkgrid','dark','white','whitegrid','tricks']
sns.set_style(style[3])
sns.axes_style()
sns.set()
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。