不正经

这个屌丝很懒，什么也没留下！

热门标签

pd.DataFrame()函数

作者：不正经 | 2024-02-27 12:51:10

踩

pd.dataframe()函数

1.DataFrame介绍

一个Datarame表示一个表格，类似电子表格的数据结构，包含一个经过排序的列表集，它的每一列都可以有不同的类型值（数字，字符串，布尔等等）。Datarame有行和列的索引；它可以被看作是一个Series的字典（Series们共享一个索引）。与其它你以前使用过的（如 R 的 data.frame )类似Datarame的结构相比，在DataFrame里的面向行和面向列的操作大致是对称的。在底层，数据是作为一个或多个二维数组存储的，而不是列表，字典，或其它一维的数组集合。

DataFrame([data, index, columns, dtype, copy])	
# Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).
1
2

2 DataFrame创建

import pandas as pd
import numpy as np
1
2

使用字典创建

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, np.nan],  # np.nan表示NA
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
DataFrame(data,
#           index=['a','b','c','d','e']
#           index = range(5)
         )  # 默认生成整数索引, 字典的键作列,值作行
1
2
3
4
5
6
7

输出结果为：


    state	year	pop
0	Ohio	2000.0	1.5
1	Ohio	2001.0	1.7
2	Ohio	2002.0	3.6
3	Nevada	2001.0	2.4
4	Nevada	NaN	2.9
1
2
3
4
5
6
7

pd.DataFrame.from_dict 方法生成DataFrame

# 两层嵌套
d = {'a': {'tp': 26, 'fp': 112},
     'b': {'tp': 26, 'fp': 91},
     'c': {'tp': 23, 'fp': 74}}
df_index = pd.DataFrame.from_dict(d, orient='index')
df_index
1
2
3
4
5
6

输出结果为：

df_columns = pd.DataFrame.from_dict(d,orient='columns')
df_columns
1
2

输出结果为：

	a	b	c
fp	112	91	74
tp	26	26	23
1
2
3

通过传递一个numpy array，时间索引以及列标签来创建一个DataFrame

data = DataFrame(np.arange(10,26).reshape((4, 4)),
                 index=['Ohio', 'Colorado', 'Utah', 'New York'], 
                 columns=['one', 'two', 'three', 'four'])
data
1
2
3
4

输出结果为：

	one	two	three	four
Ohio	10	11	12	13
Colorado	14	15	16	17
Utah	18	19	20	21
New York	22	23	24	25
1
2
3
4
5

生成一个df

np.random.seed(10)
dates = pd.date_range('20190101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df
1
2
3
4


				A			B			C			D
2019-01-01	1.331587	0.715279	-1.545400	-0.008384
2019-01-02	0.621336	-0.720086	0.265512	0.108549
2019-01-03	0.004291	-0.174600	0.433026	1.203037
2019-01-04	-0.965066	1.028274	0.228630	0.445138
2019-01-05	-1.136602	0.135137	1.484537	-1.079805
2019-01-06	-1.977728	-1.743372	0.266070	2.384967
1
2
3
4
5
6
7
8

3 DataFrame基本属性

DataFrame.index： The index (row labels) of the DataFrame.

df.index
1

输出结果为：

DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
               '2019-01-05', '2019-01-06'],
              dtype='datetime64[ns]', freq='D')
1
2
3

设置索引名
df.index.name = ‘time’

df.index.name = 'time'
1

DataFrame.columns ：The column labels of the DataFrame.

df.columns
1

输出结果为：

Index(['A', 'B', 'C', 'D'], dtype='object')
1

设置列列名

df.columns.name = 'alphabet'
1

DataFrame.values Return a Numpy representation of the DataFrame.
查看底层的Numpy数据

df.values
1

array([[-0.96506567,  1.02827408,  0.22863013,  0.44513761],
       [-1.13660221,  0.13513688,  1.484537  , -1.07980489],
       [-1.97772828, -1.7433723 ,  0.26607016,  2.38496733]])
1
2
3

4 DataFrame索引

DataFrame.head(self[, n]) Return the first n rows.

df.head(3)  # 显示前三行
1


				A			B			C			D
2019-01-01	1.331587	0.715279	-1.545400	-0.008384
2019-01-02	0.621336	-0.720086	0.265512	0.108549
2019-01-03	0.004291	-0.174600	0.433026	1.203037
1
2
3
4
5

DataFrame.tail(self[, n]) Return the last n rows.

df.tail(3)   # 显示后三行
1

				A			B			C			D
2019-01-04	-0.965066	1.028274	0.228630	0.445138
2019-01-05	-1.136602	0.135137	1.484537	-1.079805
2019-01-06	-1.977728	-1.743372	0.266070	2.384967
1
2
3
4

DataFrame.set_index(self, keys[, drop, …])

df = DataFrame({'a': range(7), 'b': range(7, 0, -1),
                      'c': ['one', 'one', 'one', 'two', 'two',
                            'two', 'two'],
                      'd': [0, 1, 2, 0, 1, 2, 3]})
df

	a	b	c	d
0	0	7	one	0
1	1	6	one	1
2	2	5	one	2
3	3	4	two	0
4	4	3	two	1
5	5	2	two	2
6	6	1	two	3
1
2
3
4
5
6
7
8
9
10
11
12
13
14

# set_index方法将DataFrame的一个或者多个列转化为行索引
df2 = df.set_index(['c', 'd'])
df2

		a	b
 c	d		
one	0	0	7
	1	1	6
	2	2	5
two	0	3	4
	1	4	3
	2	5	2
	3	6	1
1
2
3
4
5
6
7
8
9
10
11
12
13

默认drop = True，当drop=False 不删除原始数据

df.set_index(['c', 'd'], drop=False)
1

		a	b	c	d
c	d				
one	0	0	7	one	0
	1	1	6	one	1
	2	2	5	one	2	
two	0	3	4	two	0
	1	4	3	two	1
	2	5	2	two	2
	3	6	1	two	3
1
2
3
4
5
6
7
8
9

-reset_index的功能和set_index的刚好相反，层次化索引的级别会被转移到列里面

df2.reset_index()
	c	d	a	b
0	one	0	0	7
1	one	1	1	6
2	one	2	2	5
3	two	0	3	4
4	two	1	4	3
5	two	2	5	2
6	two	3	6	1
1
2
3
4
5
6
7
8
9

5 DataFrame计算、描述性统计

DataFrame.round(self[, decimals]) Round a DataFrame to a variable number of decimal places.
显示数字保留两位小数

df.round(2)
1

			A		B		C		D
2019-01-01	1.33	0.72	-1.55	-0.01
2019-01-02	0.62	-0.72	0.27	0.11
2019-01-03	0.00	-0.17	0.43	1.20
2019-01-04	-0.97	1.03	0.23	0.45
2019-01-05	-1.14	0.14	1.48	-1.08
2019-01-06	-1.98	-1.74	0.27	2.38
1
2
3
4
5
6
7

不同的列制定不同的小数位数

df.round({'A': 1, 'C': 2})
1


			  A		   B		  C		  D
2019-01-01	 1.3	 0.715279	-1.55	-0.008384
2019-01-02	 0.6	-0.720086	0.27	0.108549
2019-01-03	 0.0	-0.174600	0.43	1.203037
2019-01-04	-1.0	 1.028274	0.23	0.445138
2019-01-05	-1.1	 0.135137	1.48	-1.079805
2019-01-06	-2.0	-1.743372	0.27	2.384967
1
2
3
4
5
6
7
8

DataFrame.describe(self[, percentiles, …]) Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

# 数值型数据的快速统计汇总
df.describe()
1
2

alphabet	A			B			C			D
count	3.000000	3.000000	3.000000	3.000000
mean	-1.359799	-0.193320	0.659746	0.583433
std		0.541972	1.414715	0.714535	1.736521
min		-1.977728	-1.743372	0.228630	-1.079805
25%		-1.557165	-0.804118	0.247350	-0.317334
50%		-1.136602	0.135137	0.266070	0.445138
75%		-1.050834	0.581705	0.875304	1.415052
max		-0.965066	1.028274	1.484537	2.384967
1
2
3
4
5
6
7
8
9

DataFrame.apply(self, func[, axis, …]) Apply a function along an axis of the DataFrame.¶

df
                A	        B	        C	    D	  F
2019-01-01	0.000000	0.000000	-1.545400	5	NaN
2019-01-02	0.621336	-0.720086	0.265512	5	1.0
2019-01-03	0.004291	-0.174600	0.433026	5	2.0
2019-01-04	-0.965066	1.028274	0.228630	5	3.0
2019-01-05	-1.136602	0.135137	1.484537	5	4.0
2019-01-06	-1.977728	-1.743372	0.266070	5	5.0
1
2
3
4
5
6
7
8

df.apply(np.cumsum, axis=0, result_type=None )
1

				A			B			C		D	 F
2019-01-01	0.000000	0.000000	-1.545400	5	NaN
2019-01-02	0.621336	-0.720086	-1.279889	10	1.0
2019-01-03	0.625627	-0.894686	-0.846863	15	3.0
2019-01-04	-0.339438	0.133588	-0.618232	20	6.0
2019-01-05	-1.476040	0.268725	0.866305	25	10.0
2019-01-06	-3.453769	-1.474647	1.132375	30	15.0
1
2
3
4
5
6
7

df.apply(lambda x: x.max() - x.min())  # 每一列的极差
1

6 重新索引、选择、标签操作

DataFrame.rename(self[, mapper, index, …]) Alter axes labels.
修改列名

df.rename(columns = {'A':'key2'},inplace=False)
1

7 排序

DataFrame.sort_index(self[, axis, level, …]) Sort object by labels (along an axis).

# 默认axis=0，按行索引对行进行排序；ascending=True，升序排序
df.sort_index(axis=0, ascending=False)
# df.sort_index(axis=0, ascending=True)
1
2
3


				A			B			C			D
2019-01-06	-1.977728	-1.743372	0.266070	2.384967
2019-01-05	-1.136602	0.135137	1.484537	-1.079805
2019-01-04	-0.965066	1.028274	0.228630	0.445138
2019-01-03	0.004291	-0.174600	0.433026	1.203037
2019-01-02	0.621336	-0.720086	0.265512	0.108549
2019-01-01	1.331587	0.715279	-1.545400	-0.008384
1
2
3
4
5
6
7
8

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/不正经/article/detail/153076