赞
踩
目录
本篇文章内容来自《利用python进行数据分析》一书,仅为方便平时使用,如有错误请反馈
reindex用于创建一个适应新索引的新对象。如果某个索引值当前不存在就进入一个缺失值。
- >>> obj = Series([4.5,9.3,-8.4,6.6],index = ['d','b','a','c'])
- >>> obj
- d 4.5
- b 9.3
- a -8.4
- c 6.6
- dtype: float64
- >>> obj2 = obj.reindex(['a','b','c','d','e'])
- >>> obj2
- a -8.4
- b 9.3
- c 6.6
- d 4.5
- e NaN
- dtype: float64
- >>> obj2 = obj.reindex(['a','b','c','d','e'],fill_value=0)
- >>> obj2
- a -8.4
- b 9.3
- c 6.6
- d 4.5
- e 0.0
- dtype: float64
使用method选项进行向前填充
- >>> obj3 = Series(['blue','purple','yellow'],index = [0,2,4])
- >>> ojb3.reindex(range(6),method='ffill')
- 0 blue
- 1 blue
- 2 purple
- 3 purple
- 4 yellow
- 5 yellow
- dtype: object
丢弃某条轴上的一个或多个项目很简单,只需有一个索引数组或列表即可,由于执行一些数据对象需要数据整理和集合逻辑,所以drop方法返回的是在一个指定轴上删除了指定值得新对象:
- >>> from pandas import DataFrame,Series
- Backend TkAgg is interactive backend. Turning interactive mode on.
- import numpy as np
- >>> obj = Series(np.arange(5.),index=['a','b','c','d','e'])
- >>> new_obj = obj.drop('c')
- >>> new_obj
- a 0.0
- b 1.0
- d 3.0
- e 4.0
- dtype: float64
- >>> obj.drop(['d','c'])
- a 0.0
- b 1.0
- e 4.0
- dtype: float64
-
- >>> data = DataFrame(np.arange(16).reshape((4,4)),index=['A','B','C','D'],columns=['one','two','three','four'])
- >>> data
- one two three four
- A 0 1 2 3
- B 4 5 6 7
- C 8 9 10 11
- D 12 13 14 15
- >>> data.drop(['B','D'])
- one two three four
- A 0 1 2 3
- C 8 9 10 11
- >>> data.drop(['one','three'],axis=1)
- two four
- A 1 3
- B 5 7
- C 9 11
- D 13 15
- >>> obj = Series(np.arange(4.),index=['a','b','c','d'])
- >>> obj['b']
- 1.0
- >>> obj[1]
- 1.0
- >>> obj[2:4]
- c 2.0
- d 3.0
- dtype: float64
- >>> obj[['b','c']]
- b 1.0
- c 2.0
- dtype: float64
- >>> obj[obj>2]
- d 3.0
- dtype: float64
利用标签切片与普通的python切片运算不同,其末端是包含的(inclusive),即对DataFrame进行索引其实就是获取一个或多个列
- >>> data = DataFrame(np.arange(16).reshape((4,4)),index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four'])
- >>> data
- one two three four
- Ohio 0 1 2 3
- Colorado 4 5 6 7
- Utah 8 9 10 11
- New York 12 13 14 15
- >>> data['two']
- Ohio 1
- Colorado 5
- Utah 9
- New York 13
- Name: two, dtype: int32
- >>> data[['two','three']]
- two three
- Ohio 1 2
- Colorado 5 6
- Utah 9 10
- New York 13 14
这种索引方式有几个特殊的情况,首先通过切片或布尔型数组选取行:
- >>> data[:2]
- one two three four
- Ohio 0 1 2 3
- Colorado 4 5 6 7
- >>> data[data['three']>5]
- one two three four
- Colorado 4 5 6 7
- Utah 8 9 10 11
- New York 12 13 14 15
也可以通过布尔型DataFrame进行索引
- >>> data < 5
- one two three four
- Ohio True True True True
- Colorado True False False False
- Utah False False False False
- New York False False False False
- >>> data[data < 5] = 0
- >>> data
- one two three four
- Ohio 0 0 0 0
- Colorado 0 5 6 7
- Utah 8 9 10 11
- New York 12 13 14 15
为了在DataFrame上进行标签索引,这里引入专门的索引字段ix,可以通过Numpy式的标记法及轴标签从DataFrame中选取行和列的子集
- >>> import pandas as pd
- >>> data = DataFrame(np.arange(16).reshape((4,4)),index=['A','B','C','D'],columns=['one','two','three','four'])
- >>> data.ix['A',['one','three']]
- one 0
- three 2
- Name: A, dtype: int32
- >>> data.ix[['A','C'],[3,0,1]]
- four one two
- A 3 0 1
- C 11 8 9
- >>> data.ix[2]
- one 8
- two 9
- three 10
- four 11
- Name: C, dtype: int32
- >>> data.ix[:'C','four']
- A 3
- B 7
- C 11
- Name: four, dtype: int32
- >>> data.ix[data.three > 5,:3]
- one two three
- B 4 5 6
- C 8 9 10
- D 12 13 14
pandans的一个重要功能就是可以对不同索引对象进行算数运算,在对象相加时,如果存在不同的索引对,则结果的索引就是该索引对的并集。
自动的对其操作在不重叠的所引出引入了NA值。缺失值会在算数运算中传播。
- >>> s1 = Series([2.1,3.1,4.1,5.1,6.1],index=['a','b','c','d','e'])
- >>> s2 = Series([-2.3,4.5,5.6,7.8],index=['a','c','d','e'])
- >>> s1,s2
- (a 2.1
- b 3.1
- c 4.1
- d 5.1
- e 6.1
- dtype: float64, a -2.3
- c 4.5
- d 5.6
- e 7.8
- dtype: float64)
- >>> s1+s2
- a -0.2
- b NaN
- c 8.6
- d 10.7
- e 13.9
- dtype: float64
对于DataFrame,对其操作会同时发生在行和列上:相加后返回一个新的DataFrame,其索引和列为原来两个DatFrame的并集。
- >>> def1 = DataFrame(np.arange(9.).reshape((3,3)),columns=list('bcd'),index=['Ohio','Texas','Colorado'])
- >>> def2 = DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
- >>> def1+def2
- b c d e
- Colorado NaN NaN NaN NaN
- Ohio 3.0 NaN 6.0 NaN
- Oregon NaN NaN NaN NaN
- Texas 9.0 NaN 12.0 NaN
- Utah NaN NaN NaN NaN
在对不同索引对象进行算术运算时,没有重叠的位置就会产生Nan值,使用爱到底add()方法可以指定一个填充值:
- >>> df1 = DataFrame(np.arange(12.).reshape((3,4)),columns=list('abcd'))
- >>> df2 = DataFrame(np.arange(20.).reshape((4,5)),columns=list('abcde'))
- >>> df1
- a b c d
- 0 0.0 1.0 2.0 3.0
- 1 4.0 5.0 6.0 7.0
- 2 8.0 9.0 10.0 11.0
- >>> df2
- a b c d e
- 0 0.0 1.0 2.0 3.0 4.0
- 1 5.0 6.0 7.0 8.0 9.0
- 2 10.0 11.0 12.0 13.0 14.0
- 3 15.0 16.0 17.0 18.0 19.0
- >>> df1+df2
- a b c d e
- 0 0.0 2.0 4.0 6.0 NaN
- 1 9.0 11.0 13.0 15.0 NaN
- 2 18.0 20.0 22.0 24.0 NaN
- 3 NaN NaN NaN NaN NaN
- >>> df1.reindex(columns=df2.columns,fill_value=0)
- a b c d e
- 0 0.0 1.0 2.0 3.0 0
- 1 4.0 5.0 6.0 7.0 0
- 2 8.0 9.0 10.0 11.0 0
DataFrame与Series之间运算是有明确规定的,下面的例子展示了一个二维数组与其某行的差的运算
该过程成为广播,DataFrame与Series的运算也是如此。
- >>> arr = np.arange(12.).reshape(3,4)
- >>> arr
- array([[ 0., 1., 2., 3.],
- [ 4., 5., 6., 7.],
- [ 8., 9., 10., 11.]])
- >>> arr[2]
- array([ 8., 9., 10., 11.])
- >>> arr-arr[0]
- array([[0., 0., 0., 0.],
- [4., 4., 4., 4.],
- [8., 8., 8., 8.]])
默认情况下,DataFrame和Series之间的算数运算会将Series的索引匹配到DataFrame的列,然后沿着行一直向下广播。
- >>> from pandas import DataFrame,Series
- Backend TkAgg is interactive backend. Turning interactive mode on.
- >>> import numpy as np
- >>> frame = DataFrame(np.arange(12.).reshape((4,3)),columns = list('bde'),index=['Utah','Ohio','Texas','OreGon'])
- >>> frame
- b d e
- Utah 0.0 1.0 2.0
- Ohio 3.0 4.0 5.0
- Texas 6.0 7.0 8.0
- OreGon 9.0 10.0 11.0
- >>> series = frame.ix[0]
- >>> series
- b 0.0
- d 1.0
- e 2.0
- Name: Utah, dtype: float64
- >>> frame-series
- b d e
- Utah 0.0 0.0 0.0
- Ohio 3.0 3.0 3.0
- Texas 6.0 6.0 6.0
- OreGon 9.0 9.0 9.0
如果某个索引值在DataFrame的列或Series的索引中找不到,则参与运算的两个对象就会被重新索引以形成并集。
- >>> series2 = Series(range(3),index=['b','e','f'])
- >>> frame+series2
- b d e f
- Utah 0.0 NaN 3.0 NaN
- Ohio 3.0 NaN 6.0 NaN
- Texas 6.0 NaN 9.0 NaN
- OreGon 9.0 NaN 12.0 NaN
如果你希望匹配行,且在列上广播,则必须使用算数运算的方法。传入的轴号就是希望匹配的轴,在本例中我们的目的是匹配DataFrame的行索引并进行广播。
- >>> series3 = frame['d']
- >>> frame
- b d e
- Utah 0.0 1.0 2.0
- Ohio 3.0 4.0 5.0
- Texas 6.0 7.0 8.0
- OreGon 9.0 10.0 11.0
- >>> series3
- Utah 1.0
- Ohio 4.0
- Texas 7.0
- OreGon 10.0
- Name: d, dtype: float64
- >>> frame.sub(series3,axis=0)
- b d e
- Utah -1.0 0.0 1.0
- Ohio -1.0 0.0 1.0
- Texas -1.0 0.0 1.0
- OreGon -1.0 0.0 1.0
numpy的元素级数组方法也可以用于操作pandas对象
- >>> frame = DataFrame(np.random.randn(4,3),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
- >>> frame
- b d e
- Utah 0.225392 0.944038 -0.286161
- Ohio -0.075078 -1.416288 -1.681523
- Texas 1.674864 2.292591 0.433947
- Oregon 0.525176 1.926218 -0.891167
- >>> np.abs(frame)
- b d e
- Utah 0.225392 0.944038 0.286161
- Ohio 0.075078 1.416288 1.681523
- Texas 1.674864 2.292591 0.433947
- Oregon 0.525176 1.926218 0.891167
另一种操作是,将函数应用到由各列或行所形成的一维数组上。DataFrame的apply方法即可实现此功能
- >>> frame = DataFrame(np.random.randn(4,3),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
- >>> frame
- b d e
- Utah 0.225392 0.944038 -0.286161
- Ohio -0.075078 -1.416288 -1.681523
- Texas 1.674864 2.292591 0.433947
- Oregon 0.525176 1.926218 -0.891167
- >>> np.abs(frame)
- b d e
- Utah 0.225392 0.944038 0.286161
- Ohio 0.075078 1.416288 1.681523
- Texas 1.674864 2.292591 0.433947
- Oregon 0.525176 1.926218 0.891167
- >>> f = lambda x: x.max()-x.min()
- >>> frame.apply(f)
- b 1.749942
- d 3.708879
- e 2.115470
- dtype: float64
- >>> frame.apply(f,axis=1)
- Utah 1.230200
- Ohio 1.606445
- Texas 1.858644
- Oregon 2.817385
- dtype: float64
根据条件对数据集排序是一种重要的内置运算,使用sort_index方法,将返回一个已排序的新对象。
- >>> from pandas import DataFrame,Series
- Backend TkAgg is interactive backend. Turning interactive mode on.
- >>> import numpy as np
- >>> obj = Series(range(4),index = ['d','a','b','c'])
- >>> obj.sort_index()
- a 1
- b 2
- c 3
- d 0
- dtype: int64
对于DataFrame,可以根据任意一个轴上的索引进行排序:
数据默认是按升序排序的,但也可以降序排序。
- >>> frame = DataFrame(np.arange(8).reshape(2,4),index=['three','one'],columns=['d','a','b','c'])
- >>> frame.sort_index()
- d a b c
- one 4 5 6 7
- three 0 1 2 3
- >>> frame.sort_index(axis=1)
- a b c d
- three 1 2 3 0
- one 5 6 7 4
- >>> frame.sort_index(axis=0)
- d a b c
- one 4 5 6 7
- three 0 1 2 3
- >>> frame.sort_index(axis=1,ascending=False)
- d c b a
- three 0 3 2 1
- one 4 7 6 5
Series排序
- >>> ojb = Series([4,7,-3,2])
- >>> obj.order()
- >>> obj.sort_values()
- d 0
- a 1
- b 2
- c 3
- dtype: int64
- >>>
- >>> obj.sort_index()
- a 1
- b 2
- c 3
- d 0
- dtype: int64
如果希望根据一个或多个列中的值进行排序,将列名传给by即可
- >>> frame = DataFrame({'b':[4,7,3,8],'a':[0,1,0,1]})
- >>> frame
- a b
- 0 0 4
- 1 1 7
- 2 0 3
- 3 1 8
- >>> frame.sort_values(by='b')
- a b
- 2 0 3
- 0 0 4
- 1 1 7
- 3 1 8
- >>> frame.sort_values(by=['a','b'])
- a b
- 2 0 3
- 0 0 4
- 1 1 7
- 3 1 8
排名与排序关系密切,它会增设一个排名值,并为各组分配一个平均值来破坏评级关系
- >>> obj = Series([3,6,9,-2,-4,7,3,7])
- >>> obj.rank()
- 0 3.5
- 1 5.0
- 2 8.0
- 3 2.0
- 4 1.0
- 5 6.5
- 6 3.5
- 7 6.5
- dtype: float64
- >>> obj.rank(method='first')
- 0 3.0
- 1 5.0
- 2 8.0
- 3 2.0
- 4 1.0
- 5 6.0
- 6 4.0
- 7 7.0
- dtype: float64
- >>> obj.rank(ascending=False,method='max')
- 0 6.0
- 1 4.0
- 2 1.0
- 3 7.0
- 4 8.0
- 5 3.0
- 6 6.0
- 7 3.0
- dtype: float64
DataFrame可以在行或列上计算排名
- >>> frame = DataFrame({'b':[4.3,7,-3,2],'a':[0,1,0,1],'c':[-2,5,8,-2.5]})
- >>> frame
- a b c
- 0 0 4.3 -2.0
- 1 1 7.0 5.0
- 2 0 -3.0 8.0
- 3 1 2.0 -2.5
- >>> frame.rank(axis=1)
- a b c
- 0 2.0 3.0 1.0
- 1 1.0 3.0 2.0
- 2 2.0 1.0 3.0
- 3 2.0 3.0 1.0
带有重复值的Serise和判断其是否重复的函数
- >>> obj = Series(range(5),index=['a','a','b','b','c'])
- >>> obj
- a 0
- a 1
- b 2
- b 3
- c 4
- dtype: int64
- >>> obj.index.is_unique
- False
- >>> obj['a']
- a 0
- a 1
- obj['c']
- 4
对DatFrame的行进行索引时也是如此:
- >>> df = DataFrame(np.random.randn(4,3),index=['a','a','b','b'])
- >>> df
- 0 1 2
- a -1.619126 0.134523 0.906778
- a 0.748143 0.528331 0.470493
- b -0.480982 0.876438 -0.772287
- b -0.223553 0.002319 -0.850182
- >>> df.ix['b']
- 0 1 2
- b -0.480982 0.876438 -0.772287
- b -0.223553 0.002319 -0.850182
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。