pandas基础
pandas介绍
Python Data Analysis Library
pandas是基于NumPy 的一种工具,该工具是为了解决数据分析任务而创建的。Pandas 纳入 了大量库和一些标准的数据模型,提供了高效地操作大型结构化数据集所需的工具。
pandas核心数据结构
数据结构是计算机存储、组织数据的方式。 通常情况下,精心选择的数据结构可以带来更高的运行或者存储效率。数据结构往往同高效的检索算法和索引技术有关。
Series
Series可以理解为一个一维的数组,只是index名称可以自己改动。类似于定长的有序字典,有Index和 value。
""" pandas的Series对象 """ import pandas as pd import numpy as np # 空Series对象 s1 = pd.Series() print(s1) # Series([], dtype: float64) # 通过数组创建Series对象 data = np.array(['zs', 'ls', 'ww', 'zl']) s2 = pd.Series(data) print(s2) """ 0 zs 1 ls 2 ww 3 zl dtype: object """ # 修改索引标签 s3 = pd.Series(data, index=['s001', 's002', 's003', 's004']) print(s3) """ s001 zs s002 ls s003 ww s004 zl dtype: object """ # 从字典创建一个Series data = {'s01': 'zs', 's02': 'li', 's03': 'ww', 's04': 'zl'} s4 = pd.Series(data) print(s4) """ s01 zs s02 li s03 ww s04 zl dtype: object """ #通过标量创建一个Series s5 = pd.Series(5,index=['a','b','c']) print(s5) """ a 5 b 5 c 5 dtype: int64 """
#从Series中读取数据 print(s3) """ s001 zs s002 ls s003 ww s004 zl dtype: object """ print(s3[0])#zs 通过下标访问 print(s3[:2])#通过切片访问 """ s001 zs s002 ls dtype: object """ print(s3['s003'])#ww #通过索引标签 print(s3[['s001','s003']])#通过索引标签组 """ s001 zs s003 ww dtype: object """
pandas日期处理
import pandas as pd # pandas识别的日期字符串格式 s6 = pd.Series(['2011', '2011-01', '2011-01-02', '2012/02/01', '2011-01-02 08:00:00', '01 Jun 2012']) # to_datetime() 转换日期数据类型 s6 = pd.to_datetime(s6) print(s6) """ 0 2011-01-01 00:00:00 1 2011-01-01 00:00:00 2 2011-01-02 00:00:00 3 2012-02-01 00:00:00 4 2011-01-02 08:00:00 5 2012-06-01 00:00:00 dtype: datetime64[ns] """ # datetime类型数据支持日期运算 delta = s6-pd.to_datetime('2011-01-01') print(delta) """ 0 0 days 00:00:00 1 0 days 00:00:00 2 1 days 00:00:00 3 396 days 00:00:00 4 1 days 08:00:00 5 517 days 00:00:00 dtype: timedelta64[ns] """ #输出s6日期某字段的值 print(s6.dt.quarter) """ 0 1 1 1 2 1 3 1 4 1 5 2 dtype: int64 """ # 获取偏移天数 print(delta.dt.days) """ 0 0 1 0 2 1 3 396 4 1 5 517 """ print(s6.dt.month) """ 0 1 1 1 2 1 3 2 4 1 5 6 dtype: int64 """
Series.dt提供了很多日期相关操作,如下:
Series.dt.year The year of the datetime. Series.dt.month The month as January=1, December=12. Series.dt.day The days of the datetime. Series.dt.hour The hours of the datetime. Series.dt.minute The minutes of the datetime. Series.dt.second The seconds of the datetime. Series.dt.microsecond The microseconds of the datetime. Series.dt.week The week ordinal of the year. Series.dt.weekofyear The week ordinal of the year. Series.dt.dayofweek The day of the week with Monday=0, Sunday=6. Series.dt.weekday The day of the week with Monday=0, Sunday=6. Series.dt.dayofyear The ordinal day of the year. Series.dt.quarter The quarter of the date. Series.dt.is_month_start Indicates whether the date is the first day of the month. Series.dt.is_month_end Indicates whether the date is the last day of the month. Series.dt.is_quarter_start Indicator for whether the date is the first day of a quarter. Series.dt.is_quarter_end Indicator for whether the date is the last day of a quarter. Series.dt.is_year_start Indicate whether the date is the first day of a year. Series.dt.is_year_end Indicate whether the date is the last day of the year. Series.dt.is_leap_year Boolean indicator if the date belongs to a leap year. Series.dt.days_in_month The number of days in the month.
DateTimeIndex
通过指定周期和频率,使用date.range()
函数就可以创建日期序列。 默认情况下,范围的频率是天。
import pandas as pd # 以日为频率 datelist = pd.date_range('2019/08/21', periods=5) print(datelist) # 以月为频率 datelist = pd.date_range('2019/08/21', periods=5,freq='M') print(datelist) # 构建某个区间的时间序列 start = pd.datetime(2017, 11, 1) end = pd.datetime(2017, 11, 5) dates = pd.date_range(start, end) print(dates)
import pandas as pd datelist = pd.date_range('2011/11/03', periods=5) print(datelist)
""" datetimeindex """ import pandas as pd # 以日为频率 d = pd.date_range('2019-01-01', periods=7) print(d) """ DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04', '2019-01-05', '2019-01-06', '2019-01-07'], dtype='datetime64[ns]', freq='D') """ print(d.dtype) # datetime64[ns] print(type(d))#类型 # <class 'pandas.core.indexes.datetimes.DatetimeIndex'> #生成一组时间,默认以D向后延续fred d = pd.date_range('2019-10-01',periods=7) print(d) """ DatetimeIndex(['2019-10-01', '2019-10-02', '2019-10-03', '2019-10-04', '2019-10-05', '2019-10-06', '2019-10-07'], dtype='datetime64[ns]', freq='D') """ #生成一组时间,以M为fred 以月为频率 d2 = pd.date_range('2019-10-01',periods=5,freq='M') print(d2) """ DatetimeIndex(['2019-10-31', '2019-11-30', '2019-12-31', '2020-01-31', '2020-02-29'], dtype='datetime64[ns]', freq='M') """ #设置生成一组时间:[start,end] d3 = pd.date_range('2019-10-1','2019-10-7') print(d3) """ DatetimeIndex(['2019-10-01', '2019-10-02', '2019-10-03', '2019-10-04', '2019-10-05', '2019-10-06', '2019-10-07'], dtype='datetime64[ns]', freq='D') """ #生成一组时间,只包含工作日 d4 = pd.bdate_range('2019-10-1',periods=7) print(d4) """ DatetimeIndex(['2019-10-01', '2019-10-02', '2019-10-03', '2019-10-04', '2019-10-07', '2019-10-08', '2019-10-09'], dtype='datetime64[ns]', freq='B') """
DataFrame
DataFrame是一个类似于表格的数据类型,可以理解为一个二维数组,索引有两个维度,可更改。DataFrame具有以下特点:
-
潜在的列是不同的类型
-
大小可变
-
标记轴(行和列)
-
可以对行和列执行算术运算
import pandas as pd # 创建一个空的DataFrame df = pd.DataFrame() print(df) """ Empty DataFrame #空的 Columns: [] #列 Index: [] #索引 """ # 从列表创建DataFrame data = ['Tom', 'Jerry', 'Dog', 'Lily'] df = pd.DataFrame(data) print(df) """ 0 0 Tom 1 Jerry 2 Dog 3 Lily """ # 通过二维数组创建DataFrame # 指定列索引标签columns=['Name','Age'],不指定默认从0开始 data = [['Alex', 10], ['Bob', 12], ['Clarke', 13] ] df = pd.DataFrame(data, columns=['Name', 'Age']) print(df) """ Name Age 0 Alex 10 1 Bob 12 2 Clarke 13 """ data = [['Alex', 10], ['Bob', 12], ['Clarke', 13]] df = pd.DataFrame(data, columns=['Name', 'Age'], dtype=float) print(df) """ Name Age 0 Alex 10.0 1 Bob 12.0 2 Clarke 13.0 """ # 通过列表套字典的方式创建DataFrame data = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}] df = pd.DataFrame(data) print(df) """ a b c 0 1 2 NaN 1 5 10 20.0 """ # 从字典来创建DataFrame data = {'Name': ['Tom', 'Jack', 'Steve', 'Ricky'], 'Age': [28, 34, 29, 42]} df = pd.DataFrame(data, index=['s1', 's2', 's3', 's4']) print(df) """ Name Age s1 Tom 28 s2 Jack 34 s3 Steve 29 s4 Ricky 42 """ data = {'one': pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two': pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(data) print(df) """ one two a 1.0 1 b 2.0 2 c 3.0 3 d NaN 4 """
核心数据结构操作
列访问
DataFrame的单列数据为一个Series。根据DataFrame的定义可以 知晓DataFrame是一个带有标签的二维数组,每个标签相当每一列的列名。
import pandas as pd d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print(df['one']) """ a 1.0 b 2.0 c 3.0 d NaN Name: one, dtype: float64 """ print(df[['one', 'two']]) """ one two a 1.0 1 b 2.0 2 c 3.0 3 d NaN 4 """
列添加
DataFrame添加一列的方法非常简单,只需要新建一个列索引。并对该索引下的数据进行赋值操作即可。
import pandas as pd data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'], 'Age':[28,34,29,42]} df = pd.DataFrame(data, index=['s1','s2','s3','s4']) #访问Name列 print(df['Name'],type(df['Name'])) """ s1 Tom s2 Jack s3 Steve s4 Ricky Name: Name, dtype: object <class 'pandas.core.series.Series'> """ #添加成绩列 df['score']=pd.Series([90, 80, 70, 60], index=['s1','s2','s3','s4']) print(df) """ Name Age score s1 Tom 28 90 s2 Jack 34 80 s3 Steve 29 70 s4 Ricky 42 60 """
列删除
删除某列数据需要用到pandas提供的方法pop,pop方法的用法如下:
import pandas as pd d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 'three' : pd.Series([10, 20, 30], index=['a', 'b', 'c'])} df = pd.DataFrame(d) print("dataframe is:") print(df) """ dataframe is: one two three a 1.0 1 10.0 b 2.0 2 20.0 c 3.0 3 30.0 d NaN 4 NaN """ # 删除一列: one del(df['one']) print(df) """ two three a 1 10.0 b 2 20.0 c 3 30.0 d 4 NaN """ #调用pop方法删除一列 df.pop('two') print(df) """ three a 10.0 b 20.0 c 30.0 d NaN """
行访问
如果只是需要访问DataFrame某几行数据的实现方式则采用数组的选取方式,使用 ":" 即可:
import pandas as pd d = {'one' : pd.Series([1, 2, 3],
index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4],
index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print(df[2:4]) """ one two c 3.0 3 d NaN 4 """
import pandas as pd d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) # 通过索引名称访问 print(df.loc['b']) """ one 2.0 two 2.0 Name: b, dtype: float64 """ print(df.loc[['a', 'b']]) """ one two a 1.0 1 b 2.0 2 """
import pandas as pd d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print(df) """ one two a 1.0 1 b 2.0 2 c 3.0 3 d NaN 4 """ #通过索引访问 print(df.iloc[2]) """ one 3.0 two 3.0 Name: c, dtype: float64 """ print(df.iloc[[2, 3]]) """ one two c 3.0 3 d NaN 4 """
行添加
import pandas as pd df = pd.DataFrame([['zs', 12], ['ls', 4]], columns = ['Name','Age']) df2 = pd.DataFrame([['ww', 16], ['zl', 8]], columns = ['Name','Age']) df = df.append(df2) print(df) """ Name Age 0 zs 12 1 ls 4 0 ww 16 1 zl 8 """
行删除
使用索引标签从DataFrame中删除或删除行。 如果标签重复,则会删除多行。
import pandas as pd df = pd.DataFrame([['zs', 12], ['ls', 4]], columns = ['Name','Age']) df2 = pd.DataFrame([['ww', 16], ['zl', 8]], columns = ['Name','Age']) df = df.append(df2) print(df) """ Name Age 0 zs 12 1 ls 4 0 ww 16 1 zl 8 """ # 删除index为0的行 df = df.drop(0) print(df) """ Name Age 1 ls 4 1 zl 8 """
修改DataFrame中的数据
更改DataFrame中的数据,原理是将这部分数据提取出来,重新赋值为新的数据。
import pandas as pd df = pd.DataFrame([['zs', 12], ['ls', 4]], columns = ['Name','Age']) df2 = pd.DataFrame([['ww', 16], ['zl', 8]], columns = ['Name','Age']) df = df.append(df2) print(df) """ Name Age 0 zs 12 1 ls 4 0 ww 16 1 zl 8 """ df['Name'][0] = 'Tom' print(df) """ Name Age 0 Tom 12 1 ls 4 0 Tom 16 1 zl 8 """
DataFrame常用属性
编号 | 属性或方法 | 描述 |
---|---|---|
1 | axes | 返回 行/列 标签(index)列表。 |
2 | dtype | 返回对象的数据类型(dtype )。 |
3 | empty | 如果系列为空,则返回True 。 |
4 | ndim | 返回底层数据的维数,默认定义:1 。 |
5 | size | 返回基础数据中的元素数。 |
6 | values | 将系列作为ndarray 返回。 |
7 | head() | 返回前n 行。 |
8 | tail() | 返回最后n 行。 |
实例代码:
import pandas as pd data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'], 'Age':[28,34,29,42]} df = pd.DataFrame(data, index=['s1','s2','s3','s4']) df['score']=pd.Series([90, 80, 70, 60], index=['s1','s2','s3','s4']) # print(df) """ Name Age score s1 Tom 28 90 s2 Jack 34 80 s3 Steve 29 70 s4 Ricky 42 60 """ print(df.axes) #[Index(['s1', 's2', 's3', 's4'], dtype='object'), Index(['Name', 'Age', 'score'], dtype='object')] print(df['Age'].dtype)#int64 print(df.empty)#False print(df.ndim)#2 print(df.size)#12 print(df.values) """ [['Tom' 28 90] ['Jack' 34 80] ['Steve' 29 70] ['Ricky' 42 60]] """ print(df.head(3)) # df的前三行 """ Name Age score s1 Tom 28 90 s2 Jack 34 80 s3 Steve 29 70 """ print(df.tail(3)) # df的后三行 """ Name Age score s2 Jack 34 80 s3 Steve 29 70 s4 Ricky 42 60 """