当前位置:   article > 正文

Pandas库的基本功能---python进行数据分析_pandas backend

pandas backend

目录

 

重建索引

丢弃指定轴上的项

索引,选取和过滤

在算数方法中填充值

DataFrame与Series之间的运算

函数应用和映射

 排序和排名

带有重复值的轴索引

本篇文章内容来自《利用python进行数据分析》一书,仅为方便平时使用,如有错误请反馈


重建索引

reindex用于创建一个适应新索引的新对象。如果某个索引值当前不存在就进入一个缺失值。

  1. >>> obj = Series([4.5,9.3,-8.4,6.6],index = ['d','b','a','c'])
  2. >>> obj
  3. d 4.5
  4. b 9.3
  5. a -8.4
  6. c 6.6
  7. dtype: float64
  8. >>> obj2 = obj.reindex(['a','b','c','d','e'])
  9. >>> obj2
  10. a -8.4
  11. b 9.3
  12. c 6.6
  13. d 4.5
  14. e NaN
  15. dtype: float64
  16. >>> obj2 = obj.reindex(['a','b','c','d','e'],fill_value=0)
  17. >>> obj2
  18. a -8.4
  19. b 9.3
  20. c 6.6
  21. d 4.5
  22. e 0.0
  23. dtype: float64

使用method选项进行向前填充

  1. >>> obj3 = Series(['blue','purple','yellow'],index = [0,2,4])
  2. >>> ojb3.reindex(range(6),method='ffill')
  3. 0 blue
  4. 1 blue
  5. 2 purple
  6. 3 purple
  7. 4 yellow
  8. 5 yellow
  9. dtype: object

丢弃指定轴上的项

丢弃某条轴上的一个或多个项目很简单,只需有一个索引数组或列表即可,由于执行一些数据对象需要数据整理和集合逻辑,所以drop方法返回的是在一个指定轴上删除了指定值得新对象:

  1. >>> from pandas import DataFrame,Series
  2. Backend TkAgg is interactive backend. Turning interactive mode on.
  3. import numpy as np
  4. >>> obj = Series(np.arange(5.),index=['a','b','c','d','e'])
  5. >>> new_obj = obj.drop('c')
  6. >>> new_obj
  7. a 0.0
  8. b 1.0
  9. d 3.0
  10. e 4.0
  11. dtype: float64
  12. >>> obj.drop(['d','c'])
  13. a 0.0
  14. b 1.0
  15. e 4.0
  16. dtype: float64
  17. >>> data = DataFrame(np.arange(16).reshape((4,4)),index=['A','B','C','D'],columns=['one','two','three','four'])
  18. >>> data
  19. one two three four
  20. A 0 1 2 3
  21. B 4 5 6 7
  22. C 8 9 10 11
  23. D 12 13 14 15
  24. >>> data.drop(['B','D'])
  25. one two three four
  26. A 0 1 2 3
  27. C 8 9 10 11
  28. >>> data.drop(['one','three'],axis=1)
  29. two four
  30. A 1 3
  31. B 5 7
  32. C 9 11
  33. D 13 15

索引,选取和过滤

  1. >>> obj = Series(np.arange(4.),index=['a','b','c','d'])
  2. >>> obj['b']
  3. 1.0
  4. >>> obj[1]
  5. 1.0
  6. >>> obj[2:4]
  7. c 2.0
  8. d 3.0
  9. dtype: float64
  10. >>> obj[['b','c']]
  11. b 1.0
  12. c 2.0
  13. dtype: float64
  14. >>> obj[obj>2]
  15. d 3.0
  16. dtype: float64

利用标签切片与普通的python切片运算不同,其末端是包含的(inclusive),即对DataFrame进行索引其实就是获取一个或多个列

  1. >>> data = DataFrame(np.arange(16).reshape((4,4)),index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four'])
  2. >>> data
  3. one two three four
  4. Ohio 0 1 2 3
  5. Colorado 4 5 6 7
  6. Utah 8 9 10 11
  7. New York 12 13 14 15
  8. >>> data['two']
  9. Ohio 1
  10. Colorado 5
  11. Utah 9
  12. New York 13
  13. Name: two, dtype: int32
  14. >>> data[['two','three']]
  15. two three
  16. Ohio 1 2
  17. Colorado 5 6
  18. Utah 9 10
  19. New York 13 14

这种索引方式有几个特殊的情况,首先通过切片或布尔型数组选取行:

  1. >>> data[:2]
  2. one two three four
  3. Ohio 0 1 2 3
  4. Colorado 4 5 6 7
  5. >>> data[data['three']>5]
  6. one two three four
  7. Colorado 4 5 6 7
  8. Utah 8 9 10 11
  9. New York 12 13 14 15

也可以通过布尔型DataFrame进行索引

  1. >>> data < 5
  2. one two three four
  3. Ohio True True True True
  4. Colorado True False False False
  5. Utah False False False False
  6. New York False False False False
  7. >>> data[data < 5] = 0
  8. >>> data
  9. one two three four
  10. Ohio 0 0 0 0
  11. Colorado 0 5 6 7
  12. Utah 8 9 10 11
  13. New York 12 13 14 15

为了在DataFrame上进行标签索引,这里引入专门的索引字段ix,可以通过Numpy式的标记法及轴标签从DataFrame中选取行和列的子集

  1. >>> import pandas as pd
  2. >>> data = DataFrame(np.arange(16).reshape((4,4)),index=['A','B','C','D'],columns=['one','two','three','four'])
  3. >>> data.ix['A',['one','three']]
  4. one 0
  5. three 2
  6. Name: A, dtype: int32
  7. >>> data.ix[['A','C'],[3,0,1]]
  8. four one two
  9. A 3 0 1
  10. C 11 8 9
  11. >>> data.ix[2]
  12. one 8
  13. two 9
  14. three 10
  15. four 11
  16. Name: C, dtype: int32
  17. >>> data.ix[:'C','four']
  18. A 3
  19. B 7
  20. C 11
  21. Name: four, dtype: int32
  22. >>> data.ix[data.three > 5,:3]
  23. one two three
  24. B 4 5 6
  25. C 8 9 10
  26. D 12 13 14

pandans的一个重要功能就是可以对不同索引对象进行算数运算,在对象相加时,如果存在不同的索引对,则结果的索引就是该索引对的并集。

自动的对其操作在不重叠的所引出引入了NA值。缺失值会在算数运算中传播。

  1. >>> s1 = Series([2.1,3.1,4.1,5.1,6.1],index=['a','b','c','d','e'])
  2. >>> s2 = Series([-2.3,4.5,5.6,7.8],index=['a','c','d','e'])
  3. >>> s1,s2
  4. (a 2.1
  5. b 3.1
  6. c 4.1
  7. d 5.1
  8. e 6.1
  9. dtype: float64, a -2.3
  10. c 4.5
  11. d 5.6
  12. e 7.8
  13. dtype: float64)
  14. >>> s1+s2
  15. a -0.2
  16. b NaN
  17. c 8.6
  18. d 10.7
  19. e 13.9
  20. dtype: float64

 对于DataFrame,对其操作会同时发生在行和列上:相加后返回一个新的DataFrame,其索引和列为原来两个DatFrame的并集。

  1. >>> def1 = DataFrame(np.arange(9.).reshape((3,3)),columns=list('bcd'),index=['Ohio','Texas','Colorado'])
  2. >>> def2 = DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
  3. >>> def1+def2
  4. b c d e
  5. Colorado NaN NaN NaN NaN
  6. Ohio 3.0 NaN 6.0 NaN
  7. Oregon NaN NaN NaN NaN
  8. Texas 9.0 NaN 12.0 NaN
  9. Utah NaN NaN NaN NaN

在算数方法中填充值

在对不同索引对象进行算术运算时,没有重叠的位置就会产生Nan值,使用爱到底add()方法可以指定一个填充值:

  1. >>> df1 = DataFrame(np.arange(12.).reshape((3,4)),columns=list('abcd'))
  2. >>> df2 = DataFrame(np.arange(20.).reshape((4,5)),columns=list('abcde'))
  3. >>> df1
  4. a b c d
  5. 0 0.0 1.0 2.0 3.0
  6. 1 4.0 5.0 6.0 7.0
  7. 2 8.0 9.0 10.0 11.0
  8. >>> df2
  9. a b c d e
  10. 0 0.0 1.0 2.0 3.0 4.0
  11. 1 5.0 6.0 7.0 8.0 9.0
  12. 2 10.0 11.0 12.0 13.0 14.0
  13. 3 15.0 16.0 17.0 18.0 19.0
  14. >>> df1+df2
  15. a b c d e
  16. 0 0.0 2.0 4.0 6.0 NaN
  17. 1 9.0 11.0 13.0 15.0 NaN
  18. 2 18.0 20.0 22.0 24.0 NaN
  19. 3 NaN NaN NaN NaN NaN
  20. >>> df1.reindex(columns=df2.columns,fill_value=0)
  21. a b c d e
  22. 0 0.0 1.0 2.0 3.0 0
  23. 1 4.0 5.0 6.0 7.0 0
  24. 2 8.0 9.0 10.0 11.0 0

DataFrame与Series之间的运算

DataFrame与Series之间运算是有明确规定的,下面的例子展示了一个二维数组与其某行的差的运算

该过程成为广播,DataFrame与Series的运算也是如此。

  1. >>> arr = np.arange(12.).reshape(3,4)
  2. >>> arr
  3. array([[ 0., 1., 2., 3.],
  4. [ 4., 5., 6., 7.],
  5. [ 8., 9., 10., 11.]])
  6. >>> arr[2]
  7. array([ 8., 9., 10., 11.])
  8. >>> arr-arr[0]
  9. array([[0., 0., 0., 0.],
  10. [4., 4., 4., 4.],
  11. [8., 8., 8., 8.]])

 默认情况下,DataFrame和Series之间的算数运算会将Series的索引匹配到DataFrame的列,然后沿着行一直向下广播。

  1. >>> from pandas import DataFrame,Series
  2. Backend TkAgg is interactive backend. Turning interactive mode on.
  3. >>> import numpy as np
  4. >>> frame = DataFrame(np.arange(12.).reshape((4,3)),columns = list('bde'),index=['Utah','Ohio','Texas','OreGon'])
  5. >>> frame
  6. b d e
  7. Utah 0.0 1.0 2.0
  8. Ohio 3.0 4.0 5.0
  9. Texas 6.0 7.0 8.0
  10. OreGon 9.0 10.0 11.0
  11. >>> series = frame.ix[0]
  12. >>> series
  13. b 0.0
  14. d 1.0
  15. e 2.0
  16. Name: Utah, dtype: float64
  17. >>> frame-series
  18. b d e
  19. Utah 0.0 0.0 0.0
  20. Ohio 3.0 3.0 3.0
  21. Texas 6.0 6.0 6.0
  22. OreGon 9.0 9.0 9.0

如果某个索引值在DataFrame的列或Series的索引中找不到,则参与运算的两个对象就会被重新索引以形成并集。

  1. >>> series2 = Series(range(3),index=['b','e','f'])
  2. >>> frame+series2
  3. b d e f
  4. Utah 0.0 NaN 3.0 NaN
  5. Ohio 3.0 NaN 6.0 NaN
  6. Texas 6.0 NaN 9.0 NaN
  7. OreGon 9.0 NaN 12.0 NaN

如果你希望匹配行,且在列上广播,则必须使用算数运算的方法。传入的轴号就是希望匹配的轴,在本例中我们的目的是匹配DataFrame的行索引并进行广播。

  1. >>> series3 = frame['d']
  2. >>> frame
  3. b d e
  4. Utah 0.0 1.0 2.0
  5. Ohio 3.0 4.0 5.0
  6. Texas 6.0 7.0 8.0
  7. OreGon 9.0 10.0 11.0
  8. >>> series3
  9. Utah 1.0
  10. Ohio 4.0
  11. Texas 7.0
  12. OreGon 10.0
  13. Name: d, dtype: float64
  14. >>> frame.sub(series3,axis=0)
  15. b d e
  16. Utah -1.0 0.0 1.0
  17. Ohio -1.0 0.0 1.0
  18. Texas -1.0 0.0 1.0
  19. OreGon -1.0 0.0 1.0

函数应用和映射

 numpy的元素级数组方法也可以用于操作pandas对象

  1. >>> frame = DataFrame(np.random.randn(4,3),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
  2. >>> frame
  3. b d e
  4. Utah 0.225392 0.944038 -0.286161
  5. Ohio -0.075078 -1.416288 -1.681523
  6. Texas 1.674864 2.292591 0.433947
  7. Oregon 0.525176 1.926218 -0.891167
  8. >>> np.abs(frame)
  9. b d e
  10. Utah 0.225392 0.944038 0.286161
  11. Ohio 0.075078 1.416288 1.681523
  12. Texas 1.674864 2.292591 0.433947
  13. Oregon 0.525176 1.926218 0.891167

 另一种操作是,将函数应用到由各列或行所形成的一维数组上。DataFrame的apply方法即可实现此功能

  1. >>> frame = DataFrame(np.random.randn(4,3),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
  2. >>> frame
  3. b d e
  4. Utah 0.225392 0.944038 -0.286161
  5. Ohio -0.075078 -1.416288 -1.681523
  6. Texas 1.674864 2.292591 0.433947
  7. Oregon 0.525176 1.926218 -0.891167
  8. >>> np.abs(frame)
  9. b d e
  10. Utah 0.225392 0.944038 0.286161
  11. Ohio 0.075078 1.416288 1.681523
  12. Texas 1.674864 2.292591 0.433947
  13. Oregon 0.525176 1.926218 0.891167
  14. >>> f = lambda x: x.max()-x.min()
  15. >>> frame.apply(f)
  16. b 1.749942
  17. d 3.708879
  18. e 2.115470
  19. dtype: float64
  20. >>> frame.apply(f,axis=1)
  21. Utah 1.230200
  22. Ohio 1.606445
  23. Texas 1.858644
  24. Oregon 2.817385
  25. dtype: float64

 排序和排名

根据条件对数据集排序是一种重要的内置运算,使用sort_index方法,将返回一个已排序的新对象。

  1. >>> from pandas import DataFrame,Series
  2. Backend TkAgg is interactive backend. Turning interactive mode on.
  3. >>> import numpy as np
  4. >>> obj = Series(range(4),index = ['d','a','b','c'])
  5. >>> obj.sort_index()
  6. a 1
  7. b 2
  8. c 3
  9. d 0
  10. dtype: int64

对于DataFrame,可以根据任意一个轴上的索引进行排序:

数据默认是按升序排序的,但也可以降序排序。

  1. >>> frame = DataFrame(np.arange(8).reshape(2,4),index=['three','one'],columns=['d','a','b','c'])
  2. >>> frame.sort_index()
  3. d a b c
  4. one 4 5 6 7
  5. three 0 1 2 3
  6. >>> frame.sort_index(axis=1)
  7. a b c d
  8. three 1 2 3 0
  9. one 5 6 7 4
  10. >>> frame.sort_index(axis=0)
  11. d a b c
  12. one 4 5 6 7
  13. three 0 1 2 3
  14. >>> frame.sort_index(axis=1,ascending=False)
  15. d c b a
  16. three 0 3 2 1
  17. one 4 7 6 5

Series排序

  1. >>> ojb = Series([4,7,-3,2])
  2. >>> obj.order()
  3. >>> obj.sort_values()
  4. d 0
  5. a 1
  6. b 2
  7. c 3
  8. dtype: int64
  9. >>>
  10. >>> obj.sort_index()
  11. a 1
  12. b 2
  13. c 3
  14. d 0
  15. dtype: int64

如果希望根据一个或多个列中的值进行排序,将列名传给by即可

  1. >>> frame = DataFrame({'b':[4,7,3,8],'a':[0,1,0,1]})
  2. >>> frame
  3. a b
  4. 0 0 4
  5. 1 1 7
  6. 2 0 3
  7. 3 1 8
  8. >>> frame.sort_values(by='b')
  9. a b
  10. 2 0 3
  11. 0 0 4
  12. 1 1 7
  13. 3 1 8
  14. >>> frame.sort_values(by=['a','b'])
  15. a b
  16. 2 0 3
  17. 0 0 4
  18. 1 1 7
  19. 3 1 8

 排名与排序关系密切,它会增设一个排名值,并为各组分配一个平均值来破坏评级关系

  1. >>> obj = Series([3,6,9,-2,-4,7,3,7])
  2. >>> obj.rank()
  3. 0 3.5
  4. 1 5.0
  5. 2 8.0
  6. 3 2.0
  7. 4 1.0
  8. 5 6.5
  9. 6 3.5
  10. 7 6.5
  11. dtype: float64
  12. >>> obj.rank(method='first')
  13. 0 3.0
  14. 1 5.0
  15. 2 8.0
  16. 3 2.0
  17. 4 1.0
  18. 5 6.0
  19. 6 4.0
  20. 7 7.0
  21. dtype: float64
  22. >>> obj.rank(ascending=False,method='max')
  23. 0 6.0
  24. 1 4.0
  25. 2 1.0
  26. 3 7.0
  27. 4 8.0
  28. 5 3.0
  29. 6 6.0
  30. 7 3.0
  31. dtype: float64

DataFrame可以在行或列上计算排名

  1. >>> frame = DataFrame({'b':[4.3,7,-3,2],'a':[0,1,0,1],'c':[-2,5,8,-2.5]})
  2. >>> frame
  3. a b c
  4. 0 0 4.3 -2.0
  5. 1 1 7.0 5.0
  6. 2 0 -3.0 8.0
  7. 3 1 2.0 -2.5
  8. >>> frame.rank(axis=1)
  9. a b c
  10. 0 2.0 3.0 1.0
  11. 1 1.0 3.0 2.0
  12. 2 2.0 1.0 3.0
  13. 3 2.0 3.0 1.0

带有重复值的轴索引

带有重复值的Serise和判断其是否重复的函数

  1. >>> obj = Series(range(5),index=['a','a','b','b','c'])
  2. >>> obj
  3. a 0
  4. a 1
  5. b 2
  6. b 3
  7. c 4
  8. dtype: int64
  9. >>> obj.index.is_unique
  10. False
  11. >>> obj['a']
  12. a 0
  13. a 1
  14. obj['c']
  15. 4

对DatFrame的行进行索引时也是如此:

  1. >>> df = DataFrame(np.random.randn(4,3),index=['a','a','b','b'])
  2. >>> df
  3. 0 1 2
  4. a -1.619126 0.134523 0.906778
  5. a 0.748143 0.528331 0.470493
  6. b -0.480982 0.876438 -0.772287
  7. b -0.223553 0.002319 -0.850182
  8. >>> df.ix['b']
  9. 0 1 2
  10. b -0.480982 0.876438 -0.772287
  11. b -0.223553 0.002319 -0.850182

本篇文章内容来自《利用python进行数据分析》一书,仅为方便平时使用,如有错误请反馈

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/码创造者/article/detail/884260
推荐阅读
相关标签
  

闽ICP备14008679号