Monodyee

这个屌丝很懒，什么也没留下！

热门标签

Pandas中文官方文档：基础用法3

作者：Monodyee | 2024-03-05 20:30:58

踩

pandas中文文档

本文授权转载自Python大咖谈

禁止二次转载

大家好，我是老表

阅读文本大概需要 20 分钟

呆鸟云：“觉得有用，就请点个在看，哈哈”

函数应用

不管是为 pandas 对象应用自定义函数，还是应用其它第三方函数，都离不开以下三种方法。用哪种方法取决于操作的对象是 DataFrame 或 Series ，是行或列，还是元素。

表级函数应用：`pipe()`
行列级函数应用：apply()
聚合 API：`agg()` 与 `transform()`
元素级函数应用：`applymap()`

表级函数应用

虽然可以把 DataFrame 与 Series 传递给函数。不过，通过链式调用函数时，最好使用 pipe() 方法。对比以下两种方式：

# f, g, and h are functions taking and returning ``DataFrames``>>> f(g(h(df), arg1=1), arg2=2, arg3=3)

下列代码与上述代码等效

>>> (df.pipe(h)...    .pipe(g, arg1=1)...    .pipe(f, arg2=2, arg3=3))

pandas 鼓励使用第二种方式，即链式方法。在链式方法中调用自定义函数或第三方支持库函数时，用 pipe 更容易，与用 pandas 自身方法一样。

上例中，f、g 与 h 这几个函数都把 DataFrame 当作首位参数。要是想把数据作为第二个参数，该怎么办？本例中，pipe 为元组（callable,data_keyword）形式。.pipe 把 DataFrame 作为元组里指定的参数。

下例用 statsmodels 拟合回归。该 API 先接收一个公式，DataFrame 是第二个参数，data。要传递函数，则要用pipe 接收关键词对 (sm.ols,'data')。

In [138]: import statsmodels.formula.api as smIn [139]: bb = pd.read_csv('data/baseball.csv', index_col='id')In [140]: (bb.query('h > 0')   .....:    .assign(ln_h=lambda df: np.log(df.h))   .....:    .pipe((sm.ols, 'data'), 'hr ~ ln_h + year + g + C(lg)')   .....:    .fit()   .....:    .summary()   .....:  )   .....: Out[140]: <class 'statsmodels.iolib.summary.Summary'>"""                            OLS Regression Results                            ==============================================================================Dep. Variable:                     hr   R-squared:                       0.685Model:                            OLS   Adj. R-squared:                  0.665Method:                 Least Squares   F-statistic:                     34.28Date:                Thu, 22 Aug 2019   Prob (F-statistic):           3.48e-15Time:                        15:48:59   Log-Likelihood:                -205.92No. Observations:                  68   AIC:                             421.8Df Residuals:                      63   BIC:                             432.9Df Model:                           4                                         Covariance Type:            nonrobust                                         ===============================================================================                  coef    std err          t      P>|t|      [0.025      0.975]-------------------------------------------------------------------------------Intercept   -8484.7720   4664.146     -1.819      0.074   -1.78e+04     835.780C(lg)[T.NL]    -2.2736      1.325     -1.716      0.091      -4.922       0.375ln_h           -1.3542      0.875     -1.547      0.127      -3.103       0.395year            4.2277      2.324      1.819      0.074      -0.417       8.872g               0.1841      0.029      6.258      0.000       0.125       0.243==============================================================================Omnibus:                       10.875   Durbin-Watson:                   1.999Prob(Omnibus):                  0.004   Jarque-Bera (JB):               17.298Skew:                           0.537   Prob(JB):                     0.000175Kurtosis:                       5.225   Cond. No.                     1.49e+07==============================================================================Warnings:[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.[2] The condition number is large, 1.49e+07. This might indicate that there arestrong multicollinearity or other numerical problems.

unix 的 pipe 与后来出现的 dplyr 及 magrittr 启发了pipe 方法，在此，引入了 R 语言里用于读取 pipe 的操作符 (%>%)。pipe 的实现思路非常清晰，仿佛 Python 源生的一样。强烈建议大家阅读 pipe() 的源代码。

行列级函数应用

apply() 方法可以沿着 DataFrame 的轴应用任何函数，比如，描述性统计方法，该方法支持 axis 参数。

In [141]: df.apply(np.mean)Out[141]: one      0.811094two      1.360588three    0.187958dtype: float64In [142]: df.apply(np.mean, axis=1)Out[142]: a    1.583749b    0.734929c    1.133683d   -0.166914dtype: float64In [143]: df.apply(lambda x: x.max() - x.min())Out[143]: one      1.051928two      1.632779three    1.840607dtype: float64In [144]: df.apply(np.cumsum)Out[144]:         one       two     threea  1.394981  1.772517       NaNb  1.738035  3.684640 -0.050390c  2.433281  5.163008  1.177045d       NaN  5.442353  0.563873In [145]: df.apply(np.exp)Out[145]:         one       two     threea  4.034899  5.885648       NaNb  1.409244  6.767440  0.950858c  2.004201  4.385785  3.412466d       NaN  1.322262  0.541630

apply() 方法还支持通过函数名字符串调用函数。

In [146]: df.apply('mean')Out[146]: one      0.811094two      1.360588three    0.187958dtype: float64In [147]: df.apply('mean', axis=1)Out[147]: a    1.583749b    0.734929c    1.133683d   -0.166914dtype: float64

默认情况下，apply() 调用的函数返回的类型会影响 DataFrame.apply 输出结果的类型。

函数返回的是 Series 时，最终输出的结果是 DataFrame。输出的列与函数返回的 Series 索引相匹配。
函数返回其它任意类型时，输出结果是 Series。

result_type 会覆盖默认行为，该参数有三个选项：reduce、broadcast、expand。这些选项决定了列表型返回值是否扩展为 DataFrame。

用好 apply() 可以了解数据集的很多信息。比如可以提取每列的最大值对应的日期：

In [148]: tsdf = pd.DataFrame(np.random.randn(1000, 3), columns=['A', 'B', 'C'],   .....:                     index=pd.date_range('1/1/2000', periods=1000))   .....: In [149]: tsdf.apply(lambda x: x.idxmax())Out[149]: A   2000-08-06B   2001-01-18C   2001-07-18dtype: datetime64[ns]

还可以向 apply() 方法传递额外的参数与关键字参数。比如下例中要应用的这个函数：

def subtract_and_divide(x, sub, divide=1):    return (x - sub) / divide

可以用下列方式应用该函数：

df.apply(subtract_and_divide, args=(5,), divide=3)

为每行或每列执行 Series 方法的功能也很实用：

In [150]: tsdfOut[150]:                    A         B         C2000-01-01 -0.158131 -0.232466  0.3216042000-01-02 -1.810340 -3.105758  0.4338342000-01-03 -1.209847 -1.156793 -0.1367942000-01-04       NaN       NaN       NaN2000-01-05       NaN       NaN       NaN2000-01-06       NaN       NaN       NaN2000-01-07       NaN       NaN       NaN2000-01-08 -0.653602  0.178875  1.0082982000-01-09  1.007996  0.462824  0.2544722000-01-10  0.307473  0.600337  1.643950In [151]: tsdf.apply(pd.Series.interpolate)Out[151]:                    A         B         C2000-01-01 -0.158131 -0.232466  0.3216042000-01-02 -1.810340 -3.105758  0.4338342000-01-03 -1.209847 -1.156793 -0.1367942000-01-04 -1.098598 -0.889659  0.0922252000-01-05 -0.987349 -0.622526  0.3212432000-01-06 -0.876100 -0.355392  0.5502622000-01-07 -0.764851 -0.088259  0.7792802000-01-08 -0.653602  0.178875  1.0082982000-01-09  1.007996  0.462824  0.2544722000-01-10  0.307473  0.600337  1.643950

apply() 有一个参数 raw，默认值为 False，在应用函数前，使用该参数可以将每行或列转换为 Series。该参数为 True 时，传递的函数接收 ndarray 对象，若不需要索引功能，这种操作能显著提高性能。

聚合 API

0.20.0 版新增。

聚合 API 可以快速、简洁地执行多个聚合操作。Pandas 对象支持多个类似的 API，如 groupby API、window functions API、resample API。聚合函数为DataFrame.aggregate()，它的别名是 DataFrame.agg()。

这里使用与前例类似的 DataFrame：

In [152]: tsdf = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'],   .....:                     index=pd.date_range('1/1/2000', periods=10))   .....: In [153]: tsdf.iloc[3:7] = np.nanIn [154]: tsdfOut[154]:                    A         B         C2000-01-01  1.257606  1.004194  0.1675742000-01-02 -0.749892  0.288112 -0.7573042000-01-03 -0.207550 -0.298599  0.1160182000-01-04       NaN       NaN       NaN2000-01-05       NaN       NaN       NaN2000-01-06       NaN       NaN       NaN2000-01-07       NaN       NaN       NaN2000-01-08  0.814347 -0.257623  0.8692262000-01-09 -0.250663 -1.206601  0.8968392000-01-10  2.169758 -1.333363  0.283157

应用单个函数时，该操作与 apply() 等效，这里也可以用字符串表示聚合函数名。下面的聚合函数输出的结果为 Series：

In [155]: tsdf.agg(np.sum)Out[155]: A    3.033606B   -1.803879C    1.575510dtype: float64In [156]: tsdf.agg('sum')Out[156]: A    3.033606B   -1.803879C    1.575510dtype: float64# 因为应用的是单个函数，该操作与`.sum()` 是等效的In [157]: tsdf.sum()Out[157]: A    3.033606B   -1.803879C    1.575510dtype: float64

对 Series 进行单个聚合操作，返回的是标量值：

In [158]: tsdf.A.agg('sum')Out[158]: 3.033606102414146

多函数聚合

还可以用列表形式传递多个聚合函数。每个函数在输出结果 DataFrame 里以行的形式显示，行名是每个聚合函数的函数名。

In [159]: tsdf.agg(['sum'])Out[159]:             A         B        Csum  3.033606 -1.803879  1.57551

多个函数输出多行：

In [160]: tsdf.agg(['sum', 'mean'])Out[160]:              A         B         Csum   3.033606 -1.803879  1.575510mean  0.505601 -0.300647  0.262585

对于 Series，多个函数返回的结果也是 Series，其索引为函数名：

In [161]: tsdf.A.agg(['sum', 'mean'])Out[161]: sum     3.033606mean    0.505601Name: A, dtype: float64

传递 lambda 函数时，输出名为 <lambda> 的行：

In [162]: tsdf.A.agg(['sum', lambda x: x.mean()])Out[162]: sum         3.033606<lambda>    0.505601Name: A, dtype: float64

应用自定义函数时，则该函数名为输出结果的行名：

In [163]: def mymean(x):   .....:     return x.mean()   .....: In [164]: tsdf.A.agg(['sum', mymean])Out[164]: sum       3.033606mymean    0.505601Name: A, dtype: float64

用字典实现聚合

指定为哪些列应用哪些聚合函数时，需要把包含列名与标量（或标量列表）的字典传递给 DataFrame.agg。

注意：这里输出结果的顺序不是固定的，要想让输出顺序与输入顺序一致，请使用 OrderedDict。

In [165]: tsdf.agg({'A': 'mean', 'B': 'sum'})Out[165]: A    0.505601B   -1.803879dtype: float64

输入的参数是列表时，输出结果为 DataFrame，并以矩阵形式显示所有聚合函数的计算结果，且输出结果由所有唯一函数组成。未执行聚合操作的列输出结果为 NaN 值：

In [166]: tsdf.agg({'A': ['mean', 'min'], 'B': 'sum'})Out[166]:              A         Bmean  0.505601       NaNmin  -0.749892       NaNsum        NaN -1.803879

多种 Dtype

DataFrame 里包含不能执行聚合操作的多种 Dtype 时，.agg 只计算可以执行聚合的列。这与 groupby 的 .agg 操作类似：

In [167]: mdf = pd.DataFrame({'A': [1, 2, 3],   .....:                     'B': [1., 2., 3.],   .....:                     'C': ['foo', 'bar', 'baz'],   .....:                     'D': pd.date_range('20130101', periods=3)})   .....: In [168]: mdf.dtypesOut[168]: A             int64B           float64C            objectD    datetime64[ns]dtype: object

In [169]: mdf.agg(['min', 'sum'])Out[169]:      A    B          C          Dmin  1  1.0        bar 2013-01-01sum  6  6.0  foobarbaz        NaT

自定义 Describe

用 .agg() 可以轻松地创建与内置 describe 函数类似的自定义 describe 函数。

In [170]: from functools import partialIn [171]: q_25 = partial(pd.Series.quantile, q=0.25)In [172]: q_25.__name__ = '25%'In [173]: q_75 = partial(pd.Series.quantile, q=0.75)In [174]: q_75.__name__ = '75%'In [175]: tsdf.agg(['count', 'mean', 'std', 'min', q_25, 'median', q_75, 'max'])Out[175]:                A         B         Ccount   6.000000  6.000000  6.000000mean    0.505601 -0.300647  0.262585std     1.103362  0.887508  0.606860min    -0.749892 -1.333363 -0.75730425%    -0.239885 -0.979600  0.128907median  0.303398 -0.278111  0.22536575%     1.146791  0.151678  0.722709max     2.169758  1.004194  0.896839

Transform API

0.20.0 版新增。

transform() 方法返回的结果与原始数据具有同样索引，且大小相同。这个 API 支持同时处理多种操作，不用一个一个操作，且该 API 与 .agg API 类似。

下面先创建一个 DataFrame：

In [176]: tsdf = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'],   .....:                     index=pd.date_range('1/1/2000', periods=10))   .....: In [177]: tsdf.iloc[3:7] = np.nanIn [178]: tsdfOut[178]:                    A         B         C2000-01-01 -0.428759 -0.864890 -0.6753412000-01-02 -0.168731  1.338144 -1.2793212000-01-03 -1.621034  0.438107  0.9037942000-01-04       NaN       NaN       NaN2000-01-05       NaN       NaN       NaN2000-01-06       NaN       NaN       NaN2000-01-07       NaN       NaN       NaN2000-01-08  0.254374 -1.240447 -0.2010522000-01-09 -0.157795  0.791197 -1.1442092000-01-10 -0.030876  0.371900  0.061932

这里转换的是整个 DataFrame。.transform() 支持 Numpy 函数、字符串函数及自定义函数。

In [179]: tsdf.transform(np.abs)Out[179]:                    A         B         C2000-01-01  0.428759  0.864890  0.6753412000-01-02  0.168731  1.338144  1.2793212000-01-03  1.621034  0.438107  0.9037942000-01-04       NaN       NaN       NaN2000-01-05       NaN       NaN       NaN2000-01-06       NaN       NaN       NaN2000-01-07       NaN       NaN       NaN2000-01-08  0.254374  1.240447  0.2010522000-01-09  0.157795  0.791197  1.1442092000-01-10  0.030876  0.371900  0.061932In [180]: tsdf.transform('abs')Out[180]:                    A         B         C2000-01-01  0.428759  0.864890  0.6753412000-01-02  0.168731  1.338144  1.2793212000-01-03  1.621034  0.438107  0.9037942000-01-04       NaN       NaN       NaN2000-01-05       NaN       NaN       NaN2000-01-06       NaN       NaN       NaN2000-01-07       NaN       NaN       NaN2000-01-08  0.254374  1.240447  0.2010522000-01-09  0.157795  0.791197  1.1442092000-01-10  0.030876  0.371900  0.061932In [181]: tsdf.transform(lambda x: x.abs())Out[181]:                    A         B         C2000-01-01  0.428759  0.864890  0.6753412000-01-02  0.168731  1.338144  1.2793212000-01-03  1.621034  0.438107  0.9037942000-01-04       NaN       NaN       NaN2000-01-05       NaN       NaN       NaN2000-01-06       NaN       NaN       NaN2000-01-07       NaN       NaN       NaN2000-01-08  0.254374  1.240447  0.2010522000-01-09  0.157795  0.791197  1.1442092000-01-10  0.030876  0.371900  0.061932

这里的 transform() 接受单个函数；与 ufunc 等效。

In [182]: np.abs(tsdf)Out[182]:                    A         B         C2000-01-01  0.428759  0.864890  0.6753412000-01-02  0.168731  1.338144  1.2793212000-01-03  1.621034  0.438107  0.9037942000-01-04       NaN       NaN       NaN2000-01-05       NaN       NaN       NaN2000-01-06       NaN       NaN       NaN2000-01-07       NaN       NaN       NaN2000-01-08  0.254374  1.240447  0.2010522000-01-09  0.157795  0.791197  1.1442092000-01-10  0.030876  0.371900  0.061932

.transform() 向 Series 传递单个函数时，返回的结果也是单个 Series。

In [183]: tsdf.A.transform(np.abs)Out[183]: 2000-01-01    0.4287592000-01-02    0.1687312000-01-03    1.6210342000-01-04         NaN2000-01-05         NaN2000-01-06         NaN2000-01-07         NaN2000-01-08    0.2543742000-01-09    0.1577952000-01-10    0.030876Freq: D, Name: A, dtype: float64

多函数 Transform

transform() 调用多个函数时，将生成多重索引 DataFrame。第一层是原始数据集的列名；第二层是 transform() 调用的函数名。

In [184]: tsdf.transform([np.abs, lambda x: x + 1])Out[184]:                    A                   B                   C                      absolute  <lambda>  absolute  <lambda>  absolute  <lambda>2000-01-01  0.428759  0.571241  0.864890  0.135110  0.675341  0.3246592000-01-02  0.168731  0.831269  1.338144  2.338144  1.279321 -0.2793212000-01-03  1.621034 -0.621034  0.438107  1.438107  0.903794  1.9037942000-01-04       NaN       NaN       NaN       NaN       NaN       NaN2000-01-05       NaN       NaN       NaN       NaN       NaN       NaN2000-01-06       NaN       NaN       NaN       NaN       NaN       NaN2000-01-07       NaN       NaN       NaN       NaN       NaN       NaN2000-01-08  0.254374  1.254374  1.240447 -0.240447  0.201052  0.7989482000-01-09  0.157795  0.842205  0.791197  1.791197  1.144209 -0.1442092000-01-10  0.030876  0.969124  0.371900  1.371900  0.061932  1.061932

为 Series 应用多个函数时，输出结果是 DataFrame，列名是 transform() 调用的函数名。

In [185]: tsdf.A.transform([np.abs, lambda x: x + 1])Out[185]:             absolute  <lambda>2000-01-01  0.428759  0.5712412000-01-02  0.168731  0.8312692000-01-03  1.621034 -0.6210342000-01-04       NaN       NaN2000-01-05       NaN       NaN2000-01-06       NaN       NaN2000-01-07       NaN       NaN2000-01-08  0.254374  1.2543742000-01-09  0.157795  0.8422052000-01-10  0.030876  0.969124

用字典执行 `transform` 操作

函数字典可以为每列执行指定 transform() 操作。

In [186]: tsdf.transform({'A': np.abs, 'B': lambda x: x + 1})Out[186]:                    A         B2000-01-01  0.428759  0.1351102000-01-02  0.168731  2.3381442000-01-03  1.621034  1.4381072000-01-04       NaN       NaN2000-01-05       NaN       NaN2000-01-06       NaN       NaN2000-01-07       NaN       NaN2000-01-08  0.254374 -0.2404472000-01-09  0.157795  1.7911972000-01-10  0.030876  1.371900

transform() 的参数是列表字典时，生成的是以 transform() 调用的函数为名的多重索引 DataFrame。

In [187]: tsdf.transform({'A': np.abs, 'B': [lambda x: x + 1, 'sqrt']})Out[187]:                    A         B                      absolute  <lambda>      sqrt2000-01-01  0.428759  0.135110       NaN2000-01-02  0.168731  2.338144  1.1567822000-01-03  1.621034  1.438107  0.6618972000-01-04       NaN       NaN       NaN2000-01-05       NaN       NaN       NaN2000-01-06       NaN       NaN       NaN2000-01-07       NaN       NaN       NaN2000-01-08  0.254374 -0.240447       NaN2000-01-09  0.157795  1.791197  0.8894932000-01-10  0.030876  1.371900  0.609836

元素级函数应用

并非所有函数都能矢量化，即接受 Numpy 数组，返回另一个数组或值，DataFrame 的 applymap() 及 Series 的 map() ，支持任何接收单个值并返回单个值的 Python 函数。

示例如下：

In [188]: df4Out[188]:         one       two     threea  1.394981  1.772517       NaNb  0.343054  1.912123 -0.050390c  0.695246  1.478369  1.227435d       NaN  0.279344 -0.613172In [189]: def f(x):   .....:     return len(str(x))   .....: In [190]: df4['one'].map(f)Out[190]: a    18b    19c    18d     3Name: one, dtype: int64In [191]: df4.applymap(f)Out[191]:    one  two  threea   18   17      3b   19   18     20c   18   18     16d    3   19     19

Series.map() 还有个功能，可以“连接”或“映射”第二个 Series 定义的值。这与 merging/joining 功能联系非常紧密：

In [192]: s = pd.Series(['six', 'seven', 'six', 'seven', 'six'],   .....:               index=['a', 'b', 'c', 'd', 'e'])   .....: In [193]: t = pd.Series({'six': 6., 'seven': 7.})In [194]: sOut[194]: a      sixb    sevenc      sixd    sevene      sixdtype: objectIn [195]: s.map(t)Out[195]: a    6.0b    7.0c    6.0d    7.0e    6.0dtype: float64