赞
踩
由两个相互关联的数组(values, index)组成,前者(又称主数组)存储数据,后者存储values内每个元素对应关联的标签。
- import numpy as np
- import pandas as pd
-
- s1 = pd.Series([1, 3, 5, 7])
-
- print(s1)
- →0 1
- 1 3
- 2 5
- 3 7
- dtype: int64
-
- print(s1.values)
- →[1 3 5 7]
-
- print(s1.index)
- →RangeIndex(start=0, stop=4, step=1)
通过NumPy数组导入Series对象:
- arr1 = np.array([1, 3, 5, 7])
- s2 = pd.Series(arr1, index=['a', 'b', 'c', 'd'])
- s2_ = pd.Series(s2)
-
- print(s2)
- →a 1
- b 3
- c 5
- d 7
- dtype: int32
-
- print(s2_)
- →a 1
- b 3
- c 5
- d 7
- dtype: int32
若index数组的值在字典中有对应的键,则生成的Series中对应的元素是字典中对应的值(如果没有,其值为NaN空值)。
- dict1 = {"a": 3, "b": 4, "c": 5}
- s3 = pd.Series(dict1, index=["a", "b", "c", "d"])
-
- print(s3)
- →a 3.0
- b 4.0
- c 5.0
- d NaN
- dtype: float64
将Series的使用场景扩展到多维,由按一定顺序的多列数据(可不同类型)组成,有两个索引数组(index, columns)。
- dict2 = {"a": [1, 2, 3, 4], "b": [5, 6, 7, 8], "c": [9, 10, 11, 12]}
- df1 = pd.DataFrame(dict2)
-
- print(df1)
- → a b c
- 0 1 5 9
- 1 2 6 10
- 2 3 7 11
- 3 4 8 12
-
- df2 = pd.DataFrame(np.arange(16).reshape((4, 4)),
- index=["one", "two", "three", "four"],
- columns=["ball", "pen", "pencil", "paper"])
-
- print(df2)
- → ball pen pencil paper
- one 0 1 2 3
- two 4 5 6 7
- three 8 9 10 11
- four 12 13 14 15
函数原型read_csv(filepath, sep, names, encoding),参数分别为:导入csv文件的路径、分隔符、导入的列和指定列的顺序(默认按顺序导入所有列)和文件编码(一般为utf-8)。
read_table()的参数与read_csv()一样,但txt文件的分隔符不确定,所以参数设置需要更严格准确。
read_excel()的参数只有三个:路径名、读取表格名和读取列名,一般只需要第一个。
示例如下,其中data.csv的内容如下:
data.txt的内容如下:
data.xlsx的内容如下:
- df3 = pd.read_csv(r"D:\Pycharm professional\pythonProject\test_pandas_files\data.csv")
-
- print(df3)
- → 0 1 2
- 0 1 2 3
- 1 4 5 6
- 2 7 8 9
- 3 10 11 12
-
- df4 = pd.read_table(r"D:\Pycharm professional\pythonProject\test_pandas_files\data.txt", sep=' ', header=None)
-
- print(df4)
- → 0 1
- 0 1 2
- 1 3 4
- 2 5 6
- 3 7 8
- 4 9 10
-
- df5 = pd.read_excel(r"D:\Pycharm professional\pythonProject\test_pandas_files\data.xlsx")
-
- print(df5)
- → 0 1 2 3
- 0 a b c d
- 1 e f g h
- 2 i j k l
函数原型为to_csv(filepath, sep, names, encoding),参数分别为:导出csv文件的路径、分隔符(默认为逗号)、是否输出索引(默认为True,即输出索引)和文件编码(一般为utf-8)。
- df3.to_csv(r"D:\Pycharm professional\pythonProject\test_pandas_files\data1.csv", index=True, header=True)
- df3.to_csv(r"D:\Pycharm professional\pythonProject\test_pandas_files\data2.csv", index=False, header=True)
data1.csv的内容如下:
data2.csv的内容如下:
- print(s1[2])
- →5
-
- print(s2['c'])
- →5
-
- print(s2[0:2])
- →a 1
- b 3
- dtype: int32
-
- print(s2[['a', 'b']])
- →a 1
- b 3
- dtype: int32
- print(df2.columns)
- →Index(['ball', 'pen', 'pencil', 'paper'], dtype='object')
-
- print(type(df2.columns))
- →<class 'pandas.core.indexes.base.Index'>
-
- print(df2.index)
- →Index(['one', 'two', 'three', 'four'], dtype='object')
-
- print(type(df2.index))
- →<class 'pandas.core.indexes.base.Index'>
-
- print(df2.values)
- →[[ 0 1 2 3]
- [ 4 5 6 7]
- [ 8 9 10 11]
- [12 13 14 15]]
-
- print(type(df2.values))
- →<class 'numpy.ndarray'>
-
- print(df2["pencil"])
- →one 2
- two 6
- three 10
- four 14
- Name: pencil, dtype: int32
-
- print(df2.pen)
- →one 1
- two 5
- three 9
- four 13
- Name: pen, dtype: int32
-
- print(df2[0:2])
- → ball pen pencil paper
- one 0 1 2 3
- two 4 5 6 7
创建Series对象如下:
s4 = pd.Series([1, 3, 5, 7], index=['a', 'b', 'c', 'd'])
- s4['e'] = 9
- print(s4)
- →a 1
- b 3
- c 5
- d 7
- e 9
- dtype: int64
- s4.pop('e')
- print(s4)
- →a 1
- b 3
- c 5
- d 7
- dtype: int64
-
- print(s4.drop('c'))
- →a 1
- b 3
- d 7
- dtype: int64
-
- print(s4)
- →a 1
- b 3
- c 5
- d 7
- dtype: int64
- s4[2] = 6
- s4['a'] = 0
- print(s4)
- →a 0
- b 3
- c 6
- d 7
- dtype: int64
-
- print(s4[s4 > 4])
- →c 6
- d 7
- dtype: int64
-
- df2["pencil"][1] = 12
- print(df2)
- → ball pen pencil paper
- one 0 1 2 3
- two 4 5 12 7
- three 8 9 10 11
- four 12 13 14 15
创建DataFrame对象如下:
- arr2 = np.array([1, 2, 3, 4, 5, 6, 7, 8]).reshape(4, 2)
- df6 = pd.DataFrame(arr2, index=['a', 'b', 'c', 'd'], columns=['one', 'two'])
- print(df6)
- → one two
- a 1 2
- b 3 4
- c 5 6
- d 7 8
- print(df6.sum())
- →one 16
- two 20
- dtype: int64
-
- print(df6.sum(axis=1))
- →a 3
- b 7
- c 11
- d 15
- dtype: int64
- print(df6.cumsum())
- → one two
- a 1 2
- b 4 6
- c 9 12
- d 16 20
- print(df6.idxmax())
- →one d
- two d
- dtype: object
-
- print(df6.idxmin())
- →one a
- two a
- dtype: object
unique()返回NumPy数组,value_counts()返回Series对象(index为不重复的元素,values为不重复元素的频数)。
- s5 = pd.Series([1, 3, 5, 7, 2, 4, 3, 5, 7, 6, 7])
-
- print(s5.unique())
- →[1 3 5 7 2 4 6]
-
- print(type(s5.unique()))
- →<class 'numpy.ndarray'>
-
- print(s5.value_counts())
- →7 3
- 3 2
- 5 2
- 1 1
- 2 1
- 4 1
- 6 1
- dtype: int64
-
- print(type(s5.value_counts()))
- →<class 'pandas.core.series.Series'>
isin()判定Series对象中每个元素是否包含在给定的参数中。
- print(s5.isin([2, 4]))
- →0 False
- 1 False
- 2 False
- 3 False
- 4 True
- 5 True
- 6 False
- 7 False
- 8 False
- 9 False
- 10 False
- dtype: bool
-
- print(s5[s5.isin([2, 4])])
- →4 2
- 5 4
- dtype: int64
- s6 = pd.Series([20, 40, 60, 80])
-
- print(s6 / 2)
- →0 10.0
- 1 20.0
- 2 30.0
- 3 40.0
- dtype: float64
-
- print(np.log(s6))
- →0 2.995732
- 1 3.688879
- 2 4.094345
- 3 4.382027
- dtype: float64
数据清洗的重要过程,可按索引进行对齐运算,没对齐的位置填充NaN,数据末尾也可填充NaN。
- s7 = pd.Series({"b": 4, "c": 5, "a": 3})
- s8 = pd.Series({"a": 1, "b": 7, "c": 2, "d": 11})
-
- print(s7 + s8)
- →a 4.0
- b 11.0
- c 7.0
- d NaN
- dtype: float64
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。