当前位置:   article > 正文

利用python进行数据分析(5)

利用python进行数据分析(5)

第十三章Python建模库介绍

13.1 pandas与建模代码的结合

  • 特征工程是指从原生数据集中提取可用于模型上下文的有效信息的数据转换过程或分析。
#pandas和其他分析库的结合点通常是NumPy数组。
#要将DataFrame转换为NumPy数组,使用.values属性
import pandas as pd
import numpy as np
  • 1
  • 2
  • 3
  • 4
data = pd.DataFrame({'x0':[1,2,3,4,5],
                    'x1':[0.01,-0.01,0.25,-4.1,0],
                    'y':[-1.5,0.,3.6,1.3,-2]})
data
  • 1
  • 2
  • 3
  • 4
x0x1y
010.01-1.5
12-0.010.0
230.253.6
34-4.101.3
450.00-2.0
data.columns
  • 1
Index(['x0', 'x1', 'y'], dtype='object')
  • 1
data.values
  • 1
array([[ 1.  ,  0.01, -1.5 ],
       [ 2.  , -0.01,  0.  ],
       [ 3.  ,  0.25,  3.6 ],
       [ 4.  , -4.1 ,  1.3 ],
       [ 5.  ,  0.  , -2.  ]])
  • 1
  • 2
  • 3
  • 4
  • 5
#将数组再转换为DataFrame
df2 = pd.DataFrame(data.values,columns=['one','two','three'])
df2
  • 1
  • 2
  • 3
onetwothree
01.00.01-1.5
12.0-0.010.0
23.00.253.6
34.0-4.101.3
45.00.00-2.0
# .values属性一般在你的数据是同构化的时候使用
df3 = data.copy()
df3['strings'] = ['a','b','c','d','e']
df3
  • 1
  • 2
  • 3
  • 4
x0x1ystrings
010.01-1.5a
12-0.010.0b
230.253.6c
34-4.101.3d
450.00-2.0e
df3.values
  • 1
array([[1, 0.01, -1.5, 'a'],
       [2, -0.01, 0.0, 'b'],
       [3, 0.25, 3.6, 'c'],
       [4, -4.1, 1.3, 'd'],
       [5, 0.0, -2.0, 'e']], dtype=object)
  • 1
  • 2
  • 3
  • 4
  • 5
#对于某些模型,你可能只想使用一部分列。我推荐使用loc索引和values
model_cols = ['x0','x1']
data.loc[:,model_cols].values
  • 1
  • 2
  • 3
array([[ 1.  ,  0.01],
       [ 2.  , -0.01],
       [ 3.  ,  0.25],
       [ 4.  , -4.1 ],
       [ 5.  ,  0.  ]])
  • 1
  • 2
  • 3
  • 4
  • 5
data['category'] = pd.Categorical(['a','b','a','a','b'],categories=['a','b'])
data
  • 1
  • 2
x0x1ycategory
010.01-1.5a
12-0.010.0b
230.253.6a
34-4.101.3a
450.00-2.0b
#如果我们想使用虚拟变量替代’category’列,我们先创建虚拟变量,
#之后删除’categroy’列,然后连接结果
dummies = pd.get_dummies(data.category,prefix='category')
dummies
  • 1
  • 2
  • 3
  • 4
category_acategory_b
010
101
210
310
401
data_with_dummies = data.drop('category',axis=1).join(dummies)
data_with_dummies
  • 1
  • 2
x0x1ycategory_acategory_b
010.01-1.510
12-0.010.001
230.253.610
34-4.101.310
450.00-2.001

13.2 使用Patsy创建模型描述

  • Patsy(https://patsy.readthedocs.io/)是一个用于描述统计模型(尤其是线性模型)的Python库。
  • 它使用一种小型基于字符串的“公式语法”,这种语法受到了R、S统计编程语言中公式语法的启发。
data = pd.DataFrame({'x0':[1,2,3,4,5],
                    'x1':[0.01,-0.01,0.25,-4.1,0],
                    'y':[-1.5,0.,3.6,1.3,-2]})
data
  • 1
  • 2
  • 3
  • 4
x0x1y
010.01-1.5
12-0.010.0
230.253.6
34-4.101.3
450.00-2.0
import patsy
  • 1
y,X = patsy.dmatrices('y~x0+x1',data)
  • 1
y
  • 1
DesignMatrix with shape (5, 1)
     y
  -1.5
   0.0
   3.6
   1.3
  -2.0
  Terms:
    'y' (column 0)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
X
  • 1
DesignMatrix with shape (5, 3)
  Intercept  x0     x1
          1   1   0.01
          1   2  -0.01
          1   3   0.25
          1   4  -4.10
          1   5   0.00
  Terms:
    'Intercept' (column 0)
    'x0' (column 1)
    'x1' (column 2)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
np.asarray(y)
  • 1
array([[-1.5],
       [ 0. ],
       [ 3.6],
       [ 1.3],
       [-2. ]])
  • 1
  • 2
  • 3
  • 4
  • 5
np.asarray(X)
  • 1
array([[ 1.  ,  1.  ,  0.01],
       [ 1.  ,  2.  , -0.01],
       [ 1.  ,  3.  ,  0.25],
       [ 1.  ,  4.  , -4.1 ],
       [ 1.  ,  5.  ,  0.  ]])
  • 1
  • 2
  • 3
  • 4
  • 5
#你可以通过给模型添加名词列+0来加入截距
patsy.dmatrices('y~x0+x1+0',data)[1]
  • 1
  • 2
DesignMatrix with shape (5, 2)
  x0     x1
   1   0.01
   2  -0.01
   3   0.25
   4  -4.10
   5   0.00
  Terms:
    'x0' (column 0)
    'x1' (column 1)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
coef,resid,_,_ = np.linalg.lstsq(X,y)
  • 1
coef
  • 1
array([[ 0.31290976],
       [-0.07910564],
       [-0.26546384]])
  • 1
  • 2
  • 3
coef = pd.Series(coef.squeeze(),index=X.design_info.column_names)
  • 1
coef
  • 1
Intercept    0.312910
x0          -0.079106
x1          -0.265464
dtype: float64
  • 1
  • 2
  • 3
  • 4

13.2.1 Patsy公式中的数据转换

  • 你可以将Python代码混合到你的Patsy公式中,在执行公式时,Patsy库将尝试在封闭作用域中寻找你使用的函数
y,X = patsy.dmatrices('y~x0+np.log(np.abs(x1+1))',data)
  • 1
X
  • 1
DesignMatrix with shape (5, 3)
  Intercept  x0  np.log(np.abs(x1 + 1))
          1   1                 0.00995
          1   2                -0.01005
          1   3                 0.22314
          1   4                 1.13140
          1   5                 0.00000
  Terms:
    'Intercept' (column 0)
    'x0' (column 1)
    'np.log(np.abs(x1 + 1))' (column 2)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
#一些常用的变量转换包括标准化(对均值0和方差1)和居中(减去平均值)
y,X = patsy.dmatrices('y~standardize(x0)+center(x1)',data)
  • 1
  • 2
X
  • 1
DesignMatrix with shape (5, 3)
  Intercept  standardize(x0)  center(x1)
          1         -1.41421        0.78
          1         -0.70711        0.76
          1          0.00000        1.02
          1          0.70711       -3.33
          1          1.41421        0.77
  Terms:
    'Intercept' (column 0)
    'standardize(x0)' (column 1)
    'center(x1)' (column 2)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
#patsy.build_design_matrices函数可以使用原始样本内数据集中保存的信息将变换应用于新的样本外数据上
new_data = pd.DataFrame({'x0':[6,7,8,9],
                        'x1':[3.1,-0.5,0,2.3],
                        'y':[1,2,3,4]})
new_X = patsy.build_design_matrices([X.design_info],new_data)
new_X
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
[DesignMatrix with shape (4, 3)
   Intercept  standardize(x0)  center(x1)
           1          2.12132        3.87
           1          2.82843        0.27
           1          3.53553        0.77
           1          4.24264        3.07
   Terms:
     'Intercept' (column 0)
     'standardize(x0)' (column 1)
     'center(x1)' (column 2)]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
y,X = patsy.dmatrices('y~I(x0+x1)',data)
  • 1
X
  • 1
DesignMatrix with shape (5, 2)
  Intercept  I(x0 + x1)
          1        1.01
          1        1.99
          1        3.25
          1       -0.10
          1        5.00
  Terms:
    'Intercept' (column 0)
    'I(x0 + x1)' (column 1)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

13.2.2 分类数据与Patsy

data = pd.DataFrame({'key1':['a','a','b','b','a','b','a','b'],
                    'key2':[0,1,0,1,0,1,0,0],
                    'v1':[1,2,3,4,5,6,7,8],
                    'v2':[-1,0,2.5,-0.5,4.0,-1.2,0.2,-1.7]})

y,X = patsy.dmatrices('v2~key1',data)

X
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
DesignMatrix with shape (8, 2)
  Intercept  key1[T.b]
          1          0
          1          0
          1          1
          1          1
          1          0
          1          1
          1          0
          1          1
  Terms:
    'Intercept' (column 0)
    'key1' (column 1)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
y,X = patsy.dmatrices('v2~key1+0',data)

X
  • 1
  • 2
  • 3
DesignMatrix with shape (8, 2)
  key1[a]  key1[b]
        1        0
        1        0
        0        1
        0        1
        1        0
        0        1
        1        0
        0        1
  Terms:
    'key1' (columns 0:2)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
#数字类型列可以使用C函数解释为分类类型
y,X = patsy.dmatrices('v2~C(key2)',data)

X
  • 1
  • 2
  • 3
  • 4
DesignMatrix with shape (8, 2)
  Intercept  C(key2)[T.1]
          1             0
          1             1
          1             0
          1             1
          1             0
          1             1
          1             0
          1             0
  Terms:
    'Intercept' (column 0)
    'C(key2)' (column 1)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
data['key2'] = data['key2'].map({0:'zero',1:'one'})
data
  • 1
  • 2
key1key2v1v2
0azero1-1.0
1aone20.0
2bzero32.5
3bone4-0.5
4azero54.0
5bone6-1.2
6azero70.2
7bzero8-1.7
y,X = patsy.dmatrices('v2~key1+key2',data)

X
  • 1
  • 2
  • 3
DesignMatrix with shape (8, 3)
  Intercept  key1[T.b]  key2[T.zero]
          1          0             1
          1          0             0
          1          1             1
          1          1             0
          1          0             1
          1          1             0
          1          0             1
          1          1             1
  Terms:
    'Intercept' (column 0)
    'key1' (column 1)
    'key2' (column 2)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
y,X = patsy.dmatrices('v2~key1+key2+key1:key2',data)

X
  • 1
  • 2
  • 3
DesignMatrix with shape (8, 4)
  Intercept  key1[T.b]  key2[T.zero]  key1[T.b]:key2[T.zero]
          1          0             1                       0
          1          0             0                       0
          1          1             1                       1
          1          1             0                       0
          1          0             1                       0
          1          1             0                       0
          1          0             1                       0
          1          1             1                       1
  Terms:
    'Intercept' (column 0)
    'key1' (column 1)
    'key2' (column 2)
    'key1:key2' (column 3)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15

13.3 statsmodels介绍

  • statsmodels(http://www.statsmodels.org)是一个Python库,用于拟合多种统计模型,执行统计测试以及数据探索和可视化。statsmodels包含更多的“经典”频率学派统计方法,而贝叶斯方法和机器学习模型可在其他库中找到。
  • 线性模型,广义线性模型和鲁棒线性模型
  • 线性混合效应模型
  • 方差分析(ANOVA)方法
  • 时间序列过程和状态空间模型
  • 广义的矩量法

13.3.1 评估线性模型

import statsmodels.api as sm
import statsmodels.formula.api as smf
  • 1
  • 2
def dnorm(mean,variance,size=1):
    if isinstance(size,int):
        size = size,
        return mean+np.sqrt(variance)*np.random.randn(*size)
  • 1
  • 2
  • 3
  • 4
np.random.seed(12345)
N = 100
X = np.c_[dnorm(0,0.4,size=N),
         dnorm(0,0.6,size=N),
         dnorm(0,0.2,size=N)]

eps = dnorm(0,0.1,size=N)
beta = [0.1,0.3,0.5]
y = np.dot(X,beta)+eps
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
X[:5]
  • 1
array([[-0.12946849, -1.21275292,  0.50422488],
       [ 0.30291036, -0.43574176, -0.25417986],
       [-0.32852189, -0.02530153,  0.13835097],
       [-0.35147471, -0.71960511, -0.25821463],
       [ 1.2432688 , -0.37379916, -0.52262905]])
  • 1
  • 2
  • 3
  • 4
  • 5
y[:5]
  • 1
array([ 0.42786349, -0.67348041, -0.09087764, -0.48949442, -0.12894109])
  • 1
#线性模型通常与我们在Patsy中看到的截距项相匹配。sm.add_constant函数可以将截距列添加到现有矩阵
X_model = sm.add_constant(X)
X_model[:5]
  • 1
  • 2
  • 3
array([[ 1.        , -0.12946849, -1.21275292,  0.50422488],
       [ 1.        ,  0.30291036, -0.43574176, -0.25417986],
       [ 1.        , -0.32852189, -0.02530153,  0.13835097],
       [ 1.        , -0.35147471, -0.71960511, -0.25821463],
       [ 1.        ,  1.2432688 , -0.37379916, -0.52262905]])
  • 1
  • 2
  • 3
  • 4
  • 5
#sm.OLS类可以拟合一个最小二乘线性回归
model = sm.OLS(y,X)
  • 1
  • 2
results = model.fit()
results.params
  • 1
  • 2
array([0.17826108, 0.22303962, 0.50095093])
  • 1
print(results.summary())
  • 1
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:                      y   R-squared (uncentered):                   0.430
Model:                            OLS   Adj. R-squared (uncentered):              0.413
Method:                 Least Squares   F-statistic:                              24.42
Date:                Thu, 30 Dec 2021   Prob (F-statistic):                    7.44e-12
Time:                        11:09:06   Log-Likelihood:                         -34.305
No. Observations:                 100   AIC:                                      74.61
Df Residuals:                      97   BIC:                                      82.42
Df Model:                           3                                                  
Covariance Type:            nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.1783      0.053      3.364      0.001       0.073       0.283
x2             0.2230      0.046      4.818      0.000       0.131       0.315
x3             0.5010      0.080      6.237      0.000       0.342       0.660
==============================================================================
Omnibus:                        4.662   Durbin-Watson:                   2.201
Prob(Omnibus):                  0.097   Jarque-Bera (JB):                4.098
Skew:                           0.481   Prob(JB):                        0.129
Kurtosis:                       3.243   Cond. No.                         1.74
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
data = pd.DataFrame(X,columns=['col0','col1','col2'])
  • 1
data['y'] = y
  • 1
data[:5]
  • 1
col0col1col2y
0-0.129468-1.2127530.5042250.427863
10.302910-0.435742-0.254180-0.673480
2-0.328522-0.0253020.138351-0.090878
3-0.351475-0.719605-0.258215-0.489494
41.243269-0.373799-0.522629-0.128941
#可以使用statsmodels公式API和Patsy公式字符串
results = smf.ols('y~col0+col1+col2',data=data).fit()
  • 1
  • 2
results.params
  • 1
Intercept    0.033559
col0         0.176149
col1         0.224826
col2         0.514808
dtype: float64
  • 1
  • 2
  • 3
  • 4
  • 5
results.tvalues
  • 1
Intercept    0.952188
col0         3.319754
col1         4.850730
col2         6.303971
dtype: float64
  • 1
  • 2
  • 3
  • 4
  • 5
results.predict(data[:5])
  • 1
0   -0.002327
1   -0.141904
2    0.041226
3   -0.323070
4   -0.100535
dtype: float64
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

13.3.2 评估时间序列处理

  • statsmodels中的另一类模型用于时间序列分析。
  • 其中包括自回归过程,卡尔曼滤波和其他状态空间模型,以及多变量自回归模型。
init_x = 4
import random
values = [init_x,init_x]
N = 1000
b0 = 0.8
b1 = -0.4
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
noise = dnorm(0,0.1,N)
for i in range(N):
    new_x = values[-1]*b0+values[-2]*b1+noise[i]
    values.append(new_x)
  • 1
  • 2
  • 3
  • 4
MAXLAGX = 5
model = sm.tsa.AR(values)
results = model.fit(MAXLAGX)
  • 1
  • 2
  • 3
results.params
  • 1
array([-0.00616093,  0.78446347, -0.40847891, -0.01364148,  0.01496872,
        0.01429462])
  • 1
  • 2

13.4 scikit-learn介绍

  • scikit-learn(http://scikit-learn.org)是使用最广泛且最受信任的通用Python机器学习库。
  • 它包含广泛的标准监督的和无监督的机器学习方法,包括用于模型选择和评估、数据转换、数据加载和模型持久化的工具。
  • 这些模型可用于分类、聚类、预测和其他常见任务。
train = pd.read_csv('datasets/titanic/train.csv')
test = pd.read_csv('datasets/titanic/test.csv')
  • 1
  • 2
train[:4]
  • 1
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
train.isnull().sum()
  • 1
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
test.isnull().sum()
  • 1
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
impute_value = train['Age'].median()
train['Age'] = train['Age'].fillna(impute_value)
test['Age'] = test['Age'].fillna(impute_value)
  • 1
  • 2
  • 3
train['IsFemale'] = (train['Sex']=='female').astype(int)
test['IsFemale'] = (test['Sex']=='female').astype(int)
  • 1
  • 2
train[:5]
  • 1
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedIsFemale
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS0
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C1
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS1
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S1
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS0
test[:5]
  • 1
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedIsFemale
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ0
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS1
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ0
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS0
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS1
predictors = ['Pclass','IsFemale','Age']
X_train = train[predictors].values
X_train[:5]
  • 1
  • 2
  • 3
array([[ 3.,  0., 22.],
       [ 1.,  1., 38.],
       [ 3.,  1., 26.],
       [ 1.,  1., 35.],
       [ 3.,  0., 35.]])
  • 1
  • 2
  • 3
  • 4
  • 5
X_test = test[predictors].values
X_test[:5]
  • 1
  • 2
array([[ 3. ,  0. , 34.5],
       [ 3. ,  1. , 47. ],
       [ 2. ,  0. , 62. ],
       [ 3. ,  0. , 27. ],
       [ 3. ,  1. , 22. ]])
  • 1
  • 2
  • 3
  • 4
  • 5
y_train = train['Survived'].values
y_train[:5]
  • 1
  • 2
array([0, 1, 1, 1, 0], dtype=int64)
  • 1
#使用scikit-learn的LogisticRegression模型创建一个模型实例
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
  • 1
  • 2
  • 3
model.fit(X_train,y_train)
  • 1
LogisticRegression()
  • 1
y_predict = model.predict(X_test)
y_predict[:10]
  • 1
  • 2
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0], dtype=int64)
  • 1
from sklearn.linear_model import LogisticRegressionCV
model_cv = LogisticRegressionCV(10)
  • 1
  • 2
model_cv.fit(X_train,y_train)
  • 1
LogisticRegressionCV()
  • 1
from sklearn.model_selection import cross_val_score
model = LogisticRegression(C=10)
scores = cross_val_score(model,X_train,y_train,cv=4)
scores
  • 1
  • 2
  • 3
  • 4
array([0.77578475, 0.79820628, 0.77578475, 0.78828829])
  • 1

13.5 继续你的教育

第十四章数据分析示例

14.1 从Bitly获取1.USA.gov数据

path = 'datasets/bitly_usagov/example.txt'
open(path).readline()
  • 1
  • 2
'{ "a": "Mozilla\\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\\/535.11 (KHTML, like Gecko) Chrome\\/17.0.963.78 Safari\\/535.11", "c": "US", "nk": 1, "tz": "America\\/New_York", "gr": "MA", "g": "A6qOVH", "h": "wfLQtf", "l": "orofrog", "al": "en-US,en;q=0.8", "hh": "1.usa.gov", "r": "http:\\/\\/www.facebook.com\\/l\\/7AQEFzjSi\\/1.usa.gov\\/wfLQtf", "u": "http:\\/\\/www.ncbi.nlm.nih.gov\\/pubmed\\/22415991", "t": 1331923247, "hc": 1331822918, "cy": "Danvers", "ll": [ 42.576698, -70.954903 ] }\n'
  • 1
import json 
path = 'datasets/bitly_usagov/example.txt'
records = [json.loads(line) for line in open(path)]
records[0]
  • 1
  • 2
  • 3
  • 4
{'a': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11',
 'c': 'US',
 'nk': 1,
 'tz': 'America/New_York',
 'gr': 'MA',
 'g': 'A6qOVH',
 'h': 'wfLQtf',
 'l': 'orofrog',
 'al': 'en-US,en;q=0.8',
 'hh': '1.usa.gov',
 'r': 'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',
 'u': 'http://www.ncbi.nlm.nih.gov/pubmed/22415991',
 't': 1331923247,
 'hc': 1331822918,
 'cy': 'Danvers',
 'll': [42.576698, -70.954903]}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16

14.1.1 纯Python时区计数

time_zones = [rec['tz'] for rec in records]
#结果并不是所有的记录都有时区字段,导致出错
  • 1
  • 2
---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

<ipython-input-6-767e72d0f2fa> in <module>
----> 1 time_zones = [rec['tz'] for rec in records]
      2 #结果并不是所有的记录都有时区字段,导致出错


<ipython-input-6-767e72d0f2fa> in <listcomp>(.0)
----> 1 time_zones = [rec['tz'] for rec in records]
      2 #结果并不是所有的记录都有时区字段,导致出错


KeyError: 'tz'
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
time_zones = [rec['tz'] for rec in records if 'tz' in rec]
  • 1
#计数的一种方法是在遍历时区时使用字典来存储计数
def get_counts(sequence):
    counts = {}
    for x in sequence:
        if x in counts:
            counts[x]+=1
        else :
            counts[x]=1
    return counts
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
from collections import defaultdict
def get_counts2(sequence):
    counts = defaultdict(int)#值将会初始化为0
    for x in sequence:
        counts[x] +=1
    return counts
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
counts = get_counts(time_zones)
counts['America/New_York']
  • 1
  • 2
1251
  • 1
len(time_zones)
  • 1
3440
  • 1
#如果我们想要前十的时区和它们的计数
def top_counts(count_dict,n=10):
    value_key_pairs = [(count,tz) for tz,count in count_dict.items()]
    value_key_pairs.sort()
    return value_key_pairs[-n:]
  • 1
  • 2
  • 3
  • 4
  • 5
top_counts(counts)
  • 1
[(33, 'America/Sao_Paulo'),
 (35, 'Europe/Madrid'),
 (36, 'Pacific/Honolulu'),
 (37, 'Asia/Tokyo'),
 (74, 'Europe/London'),
 (191, 'America/Denver'),
 (382, 'America/Los_Angeles'),
 (400, 'America/Chicago'),
 (521, ''),
 (1251, 'America/New_York')]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
#如果我们搜索Python标准库,你可能会发现collections.Counter类,它可以使任务更加简单
from collections import Counter
counts = Counter(time_zones)
counts.most_common(10)
  • 1
  • 2
  • 3
  • 4
[('America/New_York', 1251),
 ('', 521),
 ('America/Chicago', 400),
 ('America/Los_Angeles', 382),
 ('America/Denver', 191),
 ('Europe/London', 74),
 ('Asia/Tokyo', 37),
 ('Pacific/Honolulu', 36),
 ('Europe/Madrid', 35),
 ('America/Sao_Paulo', 33)]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

14.1.2 使用pandas进行时区计数

import pandas as pd
frame = pd.DataFrame(records)
frame.info()
  • 1
  • 2
  • 3
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3560 entries, 0 to 3559
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   a            3440 non-null   object 
 1   c            2919 non-null   object 
 2   nk           3440 non-null   float64
 3   tz           3440 non-null   object 
 4   gr           2919 non-null   object 
 5   g            3440 non-null   object 
 6   h            3440 non-null   object 
 7   l            3440 non-null   object 
 8   al           3094 non-null   object 
 9   hh           3440 non-null   object 
 10  r            3440 non-null   object 
 11  u            3440 non-null   object 
 12  t            3440 non-null   float64
 13  hc           3440 non-null   float64
 14  cy           2919 non-null   object 
 15  ll           2919 non-null   object 
 16  _heartbeat_  120 non-null    float64
 17  kw           93 non-null     object 
dtypes: float64(4), object(14)
memory usage: 500.8+ KB
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
frame['tz'][:10]
  • 1
0     America/New_York
1       America/Denver
2     America/New_York
3    America/Sao_Paulo
4     America/New_York
5     America/New_York
6        Europe/Warsaw
7                     
8                     
9                     
Name: tz, dtype: object
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
#对于Series,我们可以使用value_counts方法
tz_counts = frame['tz'].value_counts()
tz_counts[:10]
  • 1
  • 2
  • 3
America/New_York       1251
                        521
America/Chicago         400
America/Los_Angeles     382
America/Denver          191
Europe/London            74
Asia/Tokyo               37
Pacific/Honolulu         36
Europe/Madrid            35
America/Sao_Paulo        33
Name: tz, dtype: int64
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
frame['tz'][:10]
  • 1
0     America/New_York
1       America/Denver
2     America/New_York
3    America/Sao_Paulo
4     America/New_York
5     America/New_York
6        Europe/Warsaw
7                     
8                     
9                     
Name: tz, dtype: object
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
#用fillna方法替换缺失值,并为空字符串使用布尔数组索引
clean_tz = frame['tz'].fillna('Missing')
  • 1
  • 2
clean_tz[:10]
  • 1
0     America/New_York
1       America/Denver
2     America/New_York
3    America/Sao_Paulo
4     America/New_York
5     America/New_York
6        Europe/Warsaw
7                     
8                     
9                     
Name: tz, dtype: object
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
clean_tz[clean_tz ==''] = 'Unknown'
  • 1
clean_tz[:10]
  • 1
0     America/New_York
1       America/Denver
2     America/New_York
3    America/Sao_Paulo
4     America/New_York
5     America/New_York
6        Europe/Warsaw
7              Unknown
8              Unknown
9              Unknown
Name: tz, dtype: object
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
tz_counts = clean_tz.value_counts()
tz_counts[:10]
  • 1
  • 2
America/New_York       1251
Unknown                 521
America/Chicago         400
America/Los_Angeles     382
America/Denver          191
Missing                 120
Europe/London            74
Asia/Tokyo               37
Pacific/Honolulu         36
Europe/Madrid            35
Name: tz, dtype: int64
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
import seaborn as sns
subset = tz_counts[:10]
sns.barplot(y=subset.index,x=subset.values)
  • 1
  • 2
  • 3
#a列包含了执行网址缩短的浏览器、设备或应用的信息
frame['a'][1]
  • 1
  • 2
'GoogleMaps/RochesterNY'
  • 1
frame['a'][50]
  • 1
'Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2'
  • 1
frame['a'][51]
  • 1
'Mozilla/5.0 (Linux; U; Android 2.2.2; en-us; LG-P925/V10e Build/FRG83G) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1'
  • 1
#默认空格分割
frame['a'][51].split('; ')
  • 1
  • 2
['Mozilla/5.0 (Linux',
 'U',
 'Android 2.2.2',
 'en-us',
 'LG-P925/V10e Build/FRG83G) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1']
  • 1
  • 2
  • 3
  • 4
  • 5
results = pd.Series([x.split()[0] for x in frame.a.dropna()])
results[:5]
  • 1
  • 2
0               Mozilla/5.0
1    GoogleMaps/RochesterNY
2               Mozilla/4.0
3               Mozilla/5.0
4               Mozilla/5.0
dtype: object
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
results.value_counts()[:8]
  • 1
Mozilla/5.0                 2594
Mozilla/4.0                  601
GoogleMaps/RochesterNY       121
Opera/9.80                    34
TEST_INTERNET_AGENT           24
GoogleProducer                21
Mozilla/6.0                    5
BlackBerry8520/5.0.0.681       4
dtype: int64
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
frame.a.notnull()[-10:]
  • 1
3550    True
3551    True
3552    True
3553    True
3554    True
3555    True
3556    True
3557    True
3558    True
3559    True
Name: a, dtype: bool
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
#由于一些代理字符串的缺失,我们将从数据中排除这些代理字符串
cframe = frame[frame.a.notnull()]
cframe[:10]
  • 1
  • 2
  • 3
acnktzgrghlalhhruthccyll_heartbeat_kw
0Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...US1.0America/New_YorkMAA6qOVHwfLQtforofrogen-US,en;q=0.81.usa.govhttp://www.facebook.com/l/7AQEFzjSi/1.usa.gov/...http://www.ncbi.nlm.nih.gov/pubmed/224159911.331923e+091.331823e+09Danvers[42.576698, -70.954903]NaNNaN
1GoogleMaps/RochesterNYUS0.0America/DenverUTmwszkSmwszkSbitlyNaNj.mphttp://www.AwareMap.com/http://www.monroecounty.gov/etc/911/rss.php1.331923e+091.308262e+09Provo[40.218102, -111.613297]NaNNaN
2Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ...US1.0America/New_YorkDCxxr3Qbxxr3Qbbitlyen-US1.usa.govhttp://t.co/03elZC4Qhttp://boxer.senate.gov/en/press/releases/0316...1.331923e+091.331920e+09Washington[38.9007, -77.043098]NaNNaN
3Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...BR0.0America/Sao_Paulo27zCaLwpzUtuOualelex88pt-br1.usa.govdirecthttp://apod.nasa.gov/apod/ap120312.html1.331923e+091.331923e+09Braz[-23.549999, -46.616699]NaNNaN
4Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...US0.0America/New_YorkMA9b6kNl9b6kNlbitlyen-US,en;q=0.8bit.lyhttp://www.shrewsbury-ma.gov/selco/http://www.shrewsbury-ma.gov/egov/gallery/1341...1.331923e+091.273672e+09Shrewsbury[42.286499, -71.714699]NaNNaN
5Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...US0.0America/New_YorkMAaxNK8caxNK8cbitlyen-US,en;q=0.8bit.lyhttp://www.shrewsbury-ma.gov/selco/http://www.shrewsbury-ma.gov/egov/gallery/1341...1.331923e+091.273673e+09Shrewsbury[42.286499, -71.714699]NaNNaN
6Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1...PL0.0Europe/Warsaw77wcndERzkpJBRbnjacobspl-PL,pl;q=0.8,en-US;q=0.6,en;q=0.41.usa.govhttp://plus.url.google.com/url?sa=z&n=13319232...http://www.nasa.gov/mission_pages/nustar/main/...1.331923e+091.331923e+09Luban[51.116699, 15.2833]NaNNaN
7Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/2...None0.0NaNwcndERzkpJBRbnjacobsbg,en-us;q=0.7,en;q=0.31.usa.govhttp://www.facebook.com/http://www.nasa.gov/mission_pages/nustar/main/...1.331923e+091.331923e+09NaNNaNNaNNaN
8Opera/9.80 (X11; Linux zbov; U; en) Presto/2.1...None0.0NaNwcndERzkpJBRbnjacobsen-US, en1.usa.govhttp://www.facebook.com/l.php?u=http%3A%2F%2F1...http://www.nasa.gov/mission_pages/nustar/main/...1.331923e+091.331923e+09NaNNaNNaNNaN
9Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...None0.0NaNzCaLwpzUtuOualelex88pt-BR,pt;q=0.8,en-US;q=0.6,en;q=0.41.usa.govhttp://t.co/o1Pd0WeVhttp://apod.nasa.gov/apod/ap120312.html1.331923e+091.331923e+09NaNNaNNaNNaN
#之后我们想要计算一个代表每一行是否是Windows的值
cframe['os'] = np.where(cframe['a'].str.contains('Windows'),'Windows','Not Windows')
  • 1
  • 2
<ipython-input-69-02329ab5f824>:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cframe['os'] = np.where(cframe['a'].str.contains('Windows'),'Windows','Not Windows')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
cframe['os'] [:10]
  • 1
0        Windows
1    Not Windows
2        Windows
3    Not Windows
4        Windows
5        Windows
6        Windows
7        Windows
8    Not Windows
9        Windows
Name: os, dtype: object
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
#可以根据时区列以及新生成的操作系统列对数据进行分组
by_tz_os = cframe.groupby(['tz','os'])
  • 1
  • 2
#与value_counts函数类似,分组计数可以使用size计算。然后可以使用unstack对计算结果进行重塑
agg_counts = by_tz_os.size().unstack().fillna(0)
agg_counts[:10]
  • 1
  • 2
  • 3
osNot WindowsWindows
tz
245.0276.0
Africa/Cairo0.03.0
Africa/Casablanca0.01.0
Africa/Ceuta0.02.0
Africa/Johannesburg0.01.0
Africa/Lusaka0.01.0
America/Anchorage4.01.0
America/Argentina/Buenos_Aires1.00.0
America/Argentina/Cordoba0.01.0
America/Argentina/Mendoza0.01.0
#最后,让我们选出总体计数最高的时区。
#要实现这个功能,我在agg_counts中根据行的计数构造了一个间接索引数组
#用于升序排列
indexer = agg_counts.sum(1).argsort()
indexer[:10]
  • 1
  • 2
  • 3
  • 4
  • 5
tz
                                  24
Africa/Cairo                      20
Africa/Casablanca                 21
Africa/Ceuta                      92
Africa/Johannesburg               87
Africa/Lusaka                     53
America/Anchorage                 54
America/Argentina/Buenos_Aires    57
America/Argentina/Cordoba         26
America/Argentina/Mendoza         55
dtype: int64
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
#使用take方法按顺序选出行,之后再对最后10行进行切片(最大的10个值)
count_subset = agg_counts.take(indexer[-10:])
count_subset
  • 1
  • 2
  • 3
osNot WindowsWindows
tz
America/Sao_Paulo13.020.0
Europe/Madrid16.019.0
Pacific/Honolulu0.036.0
Asia/Tokyo2.035.0
Europe/London43.031.0
America/Denver132.059.0
America/Los_Angeles130.0252.0
America/Chicago115.0285.0
245.0276.0
America/New_York339.0912.0
#pandas有一个便捷的方法叫作nlargest,可以做同样的事情
agg_counts.sum(1).nlargest(10)
  • 1
  • 2
tz
America/New_York       1251.0
                        521.0
America/Chicago         400.0
America/Los_Angeles     382.0
America/Denver          191.0
Europe/London            74.0
Asia/Tokyo               37.0
Pacific/Honolulu         36.0
Europe/Madrid            35.0
America/Sao_Paulo        33.0
dtype: float64
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
#对绘图数据重新排列
count_subset = count_subset.stack()
count_subset
  • 1
  • 2
  • 3
tz                   os         
America/Sao_Paulo    Not Windows     13.0
                     Windows         20.0
Europe/Madrid        Not Windows     16.0
                     Windows         19.0
Pacific/Honolulu     Not Windows      0.0
                     Windows         36.0
Asia/Tokyo           Not Windows      2.0
                     Windows         35.0
Europe/London        Not Windows     43.0
                     Windows         31.0
America/Denver       Not Windows    132.0
                     Windows         59.0
America/Los_Angeles  Not Windows    130.0
                     Windows        252.0
America/Chicago      Not Windows    115.0
                     Windows        285.0
                     Not Windows    245.0
                     Windows        276.0
America/New_York     Not Windows    339.0
                     Windows        912.0
dtype: float64
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
count_subset.name = 'total'
count_subset = count_subset.reset_index()
count_subset[:10]
  • 1
  • 2
  • 3
tzostotal
0America/Sao_PauloNot Windows13.0
1America/Sao_PauloWindows20.0
2Europe/MadridNot Windows16.0
3Europe/MadridWindows19.0
4Pacific/HonoluluNot Windows0.0
5Pacific/HonoluluWindows36.0
6Asia/TokyoNot Windows2.0
7Asia/TokyoWindows35.0
8Europe/LondonNot Windows43.0
9Europe/LondonWindows31.0
sns.barplot(x='total',y='tz',hue='os',data=count_subset)
  • 1
#上图不容易看到较小组中的Windows用户的相对百分比,因此让我们将组百分比归一化为1
def norm_total(group):
    group['normed_total'] = group.total/group.total.sum()
    return group 

results = count_subset.groupby('tz').apply(norm_total)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
results[:10]
  • 1
tzostotalnormed_total
0America/Sao_PauloNot Windows13.00.393939
1America/Sao_PauloWindows20.00.606061
2Europe/MadridNot Windows16.00.457143
3Europe/MadridWindows19.00.542857
4Pacific/HonoluluNot Windows0.00.000000
5Pacific/HonoluluWindows36.01.000000
6Asia/TokyoNot Windows2.00.054054
7Asia/TokyoWindows35.00.945946
8Europe/LondonNot Windows43.00.581081
9Europe/LondonWindows31.00.418919
sns.barplot(x='normed_total',y='tz',hue='os',data=results)
  • 1
#可以通过transform方法和groupby方法更有效地计算归一化之和
g = count_subset.groupby('tz')
results2 = count_subset.total/g.total.transform('sum')
  • 1
  • 2
  • 3
results2[:10]
  • 1
0    0.393939
1    0.606061
2    0.457143
3    0.542857
4    0.000000
5    1.000000
6    0.054054
7    0.945946
8    0.581081
9    0.418919
Name: total, dtype: float64
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

14.2 MovieLens 1M数据集

  • 数据提供了电影的评分、电影的元数据(流派和年份)以及观众数据(年龄、邮编、性别、职业)
  • MovieLens 1M数据集包含6,000个用户对4,000部电影的100万个评分。数据分布在三个表格中:评分,用户信息和电影信息。
  • 从ZIP文件中提取数据后,我们可以使用pandas.read_table将每个表加载到一个pandas DataFrame对象中
import pandas as pd 
#让展示内容少一点
pd.options.display.max_rows = 10
  • 1
  • 2
  • 3
unames = ['user_id','gender','age','occupation','zip']
users = pd.read_table('datasets/movielens/users.dat',sep='::',header=None,names=unames)
users[:5]
  • 1
  • 2
  • 3
<ipython-input-101-ffe8596a8cfd>:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  users = pd.read_table('datasets/movielens/users.dat',sep='::',header=None,names=unames)
  • 1
  • 2
user_idgenderageoccupationzip
01F11048067
12M561670072
23M251555117
34M45702460
45M252055455
rnames = ['user_id','movie_id','rating','timestamp']
ratings = pd.read_table('datasets/movielens/ratings.dat',sep='::',header=None,names=rnames)
ratings[:5]
  • 1
  • 2
  • 3
<ipython-input-103-bafd8ea1cf17>:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  ratings = pd.read_table('datasets/movielens/ratings.dat',sep='::',header=None,names=rnames)
  • 1
  • 2
user_idmovie_idratingtimestamp
0111935978300760
116613978302109
219143978301968
3134084978300275
4123555978824291
mnames = ['movie_id','title','genres']
movies = pd.read_table('datasets/movielens/movies.dat',sep='::',header=None,names=mnames)
movies[:5]
  • 1
  • 2
  • 3
<ipython-input-118-35e3f9b1d007>:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  movies = pd.read_table('datasets/movielens/movies.dat',sep='::',header=None,names=mnames)
  • 1
  • 2
movie_idtitlegenres
01Toy Story (1995)Animation|Children's|Comedy
12Jumanji (1995)Adventure|Children's|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama
45Father of the Bride Part II (1995)Comedy
#使用pandas的合并功能,我们首先将ratings表与users表合并,然后将该结果与movies表数据合并。
#pandas根据重叠名称推断哪些列用作合并的(或连接)键位
data = pd.merge(pd.merge(ratings,users),movies)
data[:5]
  • 1
  • 2
  • 3
  • 4
user_idmovie_idratingtimestampgenderageoccupationziptitlegenres
0111935978300760F11048067One Flew Over the Cuckoo's Nest (1975)Drama
1211935978298413M561670072One Flew Over the Cuckoo's Nest (1975)Drama
21211934978220179M251232793One Flew Over the Cuckoo's Nest (1975)Drama
31511934978199279M25722903One Flew Over the Cuckoo's Nest (1975)Drama
41711935978158471M50195350One Flew Over the Cuckoo's Nest (1975)Drama
data.iloc[0]
  • 1
user_id                                            1
movie_id                                        1193
rating                                             5
timestamp                                  978300760
gender                                             F
age                                                1
occupation                                        10
zip                                            48067
title         One Flew Over the Cuckoo's Nest (1975)
genres                                         Drama
Name: 0, dtype: object
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
#为了获得按性别分级的每部电影的平均电影评分,我们可以使用pivot_table方法
mean_ratings = data.pivot_table('rating',index='title',columns='gender',aggfunc='mean')
mean_ratings
  • 1
  • 2
  • 3
genderFM
title
$1,000,000 Duck (1971)3.3750002.761905
'Night Mother (1986)3.3888893.352941
'Til There Was You (1997)2.6756762.733333
'burbs, The (1989)2.7934782.962085
...And Justice for All (1979)3.8285713.689024
.........
Zed & Two Noughts, A (1985)3.5000003.380952
Zero Effect (1998)3.8644073.723140
Zero Kelvin (Kjærlighetens kjøtere) (1995)NaN3.500000
Zeus and Roxanne (1997)2.7777782.357143
eXistenZ (1999)3.0985923.289086

3706 rows × 2 columns

mean_ratings[:5]
  • 1
genderFM
title
$1,000,000 Duck (1971)3.3750002.761905
'Night Mother (1986)3.3888893.352941
'Til There Was You (1997)2.6756762.733333
'burbs, The (1989)2.7934782.962085
...And Justice for All (1979)3.8285713.689024
#先过滤掉少于250(完全随意定的数字)个评分的电影;
#为此,我接着按标题对数据进行分组,并使用size()为每个标题获取一个元素是各分组大小的Series
ratings_by_title = data.groupby('title').size()
ratings_by_title[:10]
  • 1
  • 2
  • 3
  • 4
title
$1,000,000 Duck (1971)                37
'Night Mother (1986)                  70
'Til There Was You (1997)             52
'burbs, The (1989)                   303
...And Justice for All (1979)        199
1-900 (1994)                           2
10 Things I Hate About You (1999)    700
101 Dalmatians (1961)                565
101 Dalmatians (1996)                364
12 Angry Men (1957)                  616
dtype: int64
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
active_titles = ratings_by_title.index[ratings_by_title >= 250]
active_titles
  • 1
  • 2
Index([''burbs, The (1989)', '10 Things I Hate About You (1999)',
       '101 Dalmatians (1961)', '101 Dalmatians (1996)', '12 Angry Men (1957)',
       '13th Warrior, The (1999)', '2 Days in the Valley (1996)',
       '20,000 Leagues Under the Sea (1954)', '2001: A Space Odyssey (1968)',
       '2010 (1984)',
       ...
       'X-Men (2000)', 'Year of Living Dangerously (1982)',
       'Yellow Submarine (1968)', 'You've Got Mail (1998)',
       'Young Frankenstein (1974)', 'Young Guns (1988)',
       'Young Guns II (1990)', 'Young Sherlock Holmes (1985)',
       'Zero Effect (1998)', 'eXistenZ (1999)'],
      dtype='object', name='title', length=1216)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
#评分多于250个的电影标题的索引之后可以用于从mean_ratings中选出所需的行
mean_ratings = mean_ratings.loc[active_titles]
mean_ratings
  • 1
  • 2
  • 3
genderFM
title
'burbs, The (1989)2.7934782.962085
10 Things I Hate About You (1999)3.6465523.311966
101 Dalmatians (1961)3.7914443.500000
101 Dalmatians (1996)3.2400002.911215
12 Angry Men (1957)4.1843974.328421
.........
Young Guns (1988)3.3717953.425620
Young Guns II (1990)2.9347832.904025
Young Sherlock Holmes (1985)3.5147063.363344
Zero Effect (1998)3.8644073.723140
eXistenZ (1999)3.0985923.289086

1216 rows × 2 columns

#要看到女性观众的top电影,我们可以按F列降序排序
top_female_ratings = mean_ratings.sort_values(by='F',ascending=False)
top_female_ratings[:10]
  • 1
  • 2
  • 3
genderFM
title
Close Shave, A (1995)4.6444444.473795
Wrong Trousers, The (1993)4.5882354.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)4.5726504.464589
Wallace & Gromit: The Best of Aardman Animation (1996)4.5631074.385075
Schindler's List (1993)4.5626024.491415
Shawshank Redemption, The (1994)4.5390754.560625
Grand Day Out, A (1992)4.5378794.293255
To Kill a Mockingbird (1962)4.5366674.372611
Creature Comforts (1990)4.5138894.272277
Usual Suspects, The (1995)4.5133174.518248

14.2.1 测量评价分歧

  • 假设你想找到男性和女性观众之间最具分歧性的电影
#一种方法是添加一列到含有均值差的mean_ratings中,然后按以下方式排序
mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
  • 1
  • 2
mean_ratings[:10]
  • 1
genderFMdiff
title
'burbs, The (1989)2.7934782.9620850.168607
10 Things I Hate About You (1999)3.6465523.311966-0.334586
101 Dalmatians (1961)3.7914443.500000-0.291444
101 Dalmatians (1996)3.2400002.911215-0.328785
12 Angry Men (1957)4.1843974.3284210.144024
13th Warrior, The (1999)3.1120003.1680000.056000
2 Days in the Valley (1996)3.4888893.244813-0.244076
20,000 Leagues Under the Sea (1954)3.6701033.7092050.039102
2001: A Space Odyssey (1968)3.8255814.1297380.304156
2010 (1984)3.4468093.413712-0.033097
sorted_by_diff = mean_ratings.sort_values(by='diff')
sorted_by_diff[:10]
  • 1
  • 2
genderFMdiff
title
Dirty Dancing (1987)3.7903782.959596-0.830782
Jumpin' Jack Flash (1986)3.2547172.578358-0.676359
Grease (1978)3.9752653.367041-0.608224
Little Women (1994)3.8705883.321739-0.548849
Steel Magnolias (1989)3.9017343.365957-0.535777
Anastasia (1997)3.8000003.281609-0.518391
Rocky Horror Picture Show, The (1975)3.6730163.160131-0.512885
Color Purple, The (1985)4.1581923.659341-0.498851
Age of Innocence, The (1993)3.8270683.339506-0.487561
Free Willy (1993)2.9213482.438776-0.482573
#转换行的顺序,并切片出top 10的行,我们就可以获得男性更喜欢但女性评分不高的电影
sorted_by_diff[::-1][:10]
  • 1
  • 2
genderFMdiff
title
Good, The Bad and The Ugly, The (1966)3.4949494.2213000.726351
Kentucky Fried Movie, The (1977)2.8787883.5551470.676359
Dumb & Dumber (1994)2.6979873.3365950.638608
Longest Day, The (1962)3.4117654.0314470.619682
Cable Guy, The (1996)2.2500002.8637870.613787
Evil Dead II (Dead By Dawn) (1987)3.2972973.9092830.611985
Hidden, The (1987)3.1379313.7450980.607167
Rocky III (1982)2.3617022.9435030.581801
Caddyshack (1980)3.3961353.9697370.573602
For a Few Dollars More (1965)3.4090913.9537950.544704
#假设你想要的是不依赖于性别标识而在观众中引起最大异议的电影。
#异议可以通过评分的方差或标准差来衡量
ratings_std_by_title = data.groupby('title')['rating'].std()
ratings_std_by_title[:10]
  • 1
  • 2
  • 3
  • 4
title
$1,000,000 Duck (1971)               1.092563
'Night Mother (1986)                 1.118636
'Til There Was You (1997)            1.020159
'burbs, The (1989)                   1.107760
...And Justice for All (1979)        0.878110
1-900 (1994)                         0.707107
10 Things I Hate About You (1999)    0.989815
101 Dalmatians (1961)                0.982103
101 Dalmatians (1996)                1.098717
12 Angry Men (1957)                  0.812731
Name: rating, dtype: float64
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
ratings_std_by_title = ratings_std_by_title.loc[active_titles]
ratings_std_by_title[:10]
  • 1
  • 2
title
'burbs, The (1989)                     1.107760
10 Things I Hate About You (1999)      0.989815
101 Dalmatians (1961)                  0.982103
101 Dalmatians (1996)                  1.098717
12 Angry Men (1957)                    0.812731
13th Warrior, The (1999)               1.140421
2 Days in the Valley (1996)            0.921592
20,000 Leagues Under the Sea (1954)    0.869685
2001: A Space Odyssey (1968)           1.042504
2010 (1984)                            0.946618
Name: rating, dtype: float64
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
ratings_std_by_title.sort_values(ascending=False)[:10]
  • 1
title
Dumb & Dumber (1994)                     1.321333
Blair Witch Project, The (1999)          1.316368
Natural Born Killers (1994)              1.307198
Tank Girl (1995)                         1.277695
Rocky Horror Picture Show, The (1975)    1.260177
Eyes Wide Shut (1999)                    1.259624
Evita (1996)                             1.253631
Billy Madison (1995)                     1.249970
Fear and Loathing in Las Vegas (1998)    1.246408
Bicentennial Man (1999)                  1.245533
Name: rating, dtype: float64
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12

14.3 美国1880~2010年的婴儿名字

  • 美国社会保障局(SSA)提供了从1880年至现在的婴儿姓名频率的数据。
names1880 = pd.read_table('datasets/babynames/yob1880.txt',sep=',',names=['name','sex','births'])
names1880.head()
  • 1
  • 2
namesexbirths
0MaryF7065
1AnnaF2604
2EmmaF2003
3ElizabethF1939
4MinnieF1746
#为简单起见,我们可以使用按性别列出的出生总和作为当年的出生总数
names1880.groupby('sex').births.sum()
  • 1
  • 2
sex
F     90993
M    110493
Name: births, dtype: int64
  • 1
  • 2
  • 3
  • 4
#由于数据集按年分为多个文件,首先要做的事情之一是将所有数据集中到一个DataFrame中,然后再添加一个年份字段。
#你可以使用pandas.concat来做到这一点
years = range(1880,2011)
pieces = []
columns = ['name','sex','births']
for year in years:
    path = 'datasets/babynames/yob%d.txt' %year
    frame = pd.read_csv(path,names=columns)
    frame['year'] = year
    pieces.append(frame)
#将所有内容粘贴进一个
names = pd.concat(pieces,ignore_index=True)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
names.head()
  • 1
namesexbirthsyear
0MaryF70651880
1AnnaF26041880
2EmmaF20031880
3ElizabethF19391880
4MinnieF17461880
#可以使用groupby或pivot_table开始聚合年份和性别的数据
total_births = names.pivot_table('births',index='year',columns='sex',aggfunc=sum)
total_births.tail()
  • 1
  • 2
  • 3
sexFM
year
200618964682050234
200719168882069242
200818836452032310
200918276431973359
201017590101898382
total_births.plot(title='Total births by sex and year')
  • 1
def add_prop(group):
    group['prop'] = group.births/group.births.sum()
    return group

names = names.groupby(['year','sex']).apply(add_prop)
  • 1
  • 2
  • 3
  • 4
  • 5
names
  • 1
namesexbirthsyeargroupprop
0MaryF706518800.0776430.077643
1AnnaF260418800.0286180.028618
2EmmaF200318800.0220130.022013
3ElizabethF193918800.0213090.021309
4MinnieF174618800.0191880.019188
.....................
1690779ZymaireM520100.0000030.000003
1690780ZyonneM520100.0000030.000003
1690781ZyquariusM520100.0000030.000003
1690782ZyranM520100.0000030.000003
1690783ZzyzxM520100.0000030.000003

1690784 rows × 6 columns

#在执行此类组操作时,进行完整性检查通常很有价值,例如验证所有组中的prop列总计为1
names.groupby(['year','sex']).prop.sum()
  • 1
  • 2
year  sex
1880  F      1.0
      M      1.0
1881  F      1.0
      M      1.0
1882  F      1.0
            ... 
2008  M      1.0
2009  F      1.0
      M      1.0
2010  F      1.0
      M      1.0
Name: prop, Length: 262, dtype: float64
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
#每个性别/年份组合的前1,000名
def get_top1000(group):
    return group.sort_values(by='births',ascending=False)[:1000]

grouped = names.groupby(['year','sex'])
top1000 = grouped.apply(get_top1000)
#删除组索引,不需要它
top1000.reset_index(inplace=True,drop=True)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
#如果你喜欢DIY方式,可以试试以下代码
pieces = []
for year,group in names.groupby(['year','sex']):
    pieces.append(group.sort_values(by='births',ascending=False)[:1000])
    top1000 = pd.concat(pieces,ignore_index=True)
  • 1
  • 2
  • 3
  • 4
  • 5
top1000
  • 1
namesexbirthsyeargroupprop
0MaryF706518800.0776430.077643
1AnnaF260418800.0286180.028618
2EmmaF200318800.0220130.022013
3ElizabethF193918800.0213090.021309
4MinnieF174618800.0191880.019188
.....................
261872CamiloM19420100.0001020.000102
261873DestinM19420100.0001020.000102
261874JaquanM19420100.0001020.000102
261875JaydanM19420100.0001020.000102
261876MaxtonM19320100.0001020.000102

261877 rows × 6 columns

14.3.1 分析名字趋势

boys = top1000[top1000.sex=='M']
girls = top1000[top1000.sex=='F']
  • 1
  • 2
total_births = top1000.pivot_table('births',index='year',columns='name',aggfunc=sum)
total_births
  • 1
  • 2
nameAadenAaliyahAaravAaronAarushAbAbagailAbbAbbeyAbbieAbbigailAbbottAbbyAbdielAbdulAbdullahAbeAbelAbelardoAbigailAbigaleAbigayleAbnerAbrahamAbramAbrilAceAcieAdaAdahAdalbertoAdalineAdalynAdalynnAdamAdamarisAdamsAdanAddaAddieAddilynAddisonAddisynAddysonAdelaAdelaideAdelardAdelbertAdeleAdeliaAdelinaAdelineAdellAdellaAdelleAdelynAdelynnAdenAdileneAdin...ZadaZadieZaidZaidaZaideeZaidenZainZaireZakaryZanaZanderZandraZaneZaniyahZaraZariaZariahZariyahZavierZavionZaydenZayneZebZebulonZechariahZedZekeZelaZeldaZeliaZellaZelmaZelphaZenaZenasZeniaZennieZenoZenobiaZetaZettaZettieZhaneZigmundZillahZilpahZilphaZinaZionZitaZoaZoeZoeyZoieZolaZollieZonaZoraZulaZuri
year
1880NaNNaNNaN102.0NaNNaNNaNNaNNaN71.0NaNNaN6.0NaNNaNNaN50.09.0NaN12.0NaNNaN27.081.021.0NaNNaNNaN652.024.0NaN23.0NaNNaN104.0NaNNaNNaN14.0282.0NaN19.0NaNNaN9.065.0NaN28.041.018.0NaN54.0NaN26.05.0NaNNaN7.0NaNNaN...13.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN10.0NaNNaNNaN6.0NaN6.0NaN31.019.0NaN7.0NaNNaNNaNNaNNaNNaN8.0NaNNaNNaNNaNNaN6.0NaNNaNNaN8.023.0NaNNaN7.0NaN8.028.027.0NaN
1881NaNNaNNaN94.0NaNNaNNaNNaNNaN81.0NaNNaN7.0NaNNaNNaN36.012.0NaN8.0NaNNaN30.086.030.0NaNNaN6.0628.029.0NaN18.0NaNNaN116.0NaNNaNNaN20.0294.0NaN17.0NaNNaN7.062.0NaN14.043.021.0NaN58.014.016.0NaNNaNNaNNaNNaNNaN...8.011.0NaN6.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN10.0NaNNaNNaNNaNNaNNaNNaN38.017.0NaN6.0NaNNaNNaNNaN6.0NaN7.0NaNNaNNaN7.09.06.0NaNNaNNaNNaN22.0NaNNaN10.0NaN9.021.027.0NaN
1882NaNNaNNaN85.0NaNNaNNaNNaNNaN80.0NaNNaN11.0NaNNaNNaN50.010.0NaN14.0NaNNaN32.091.025.0NaN8.0NaN689.027.0NaN16.0NaNNaN114.0NaNNaNNaN17.0347.0NaN21.0NaNNaN17.074.0NaN14.064.023.0NaN70.0NaN18.0NaNNaNNaNNaNNaNNaN...9.07.0NaNNaN5.0NaNNaNNaNNaN5.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN10.0NaNNaNNaNNaNNaN6.0NaN50.021.0NaN6.0NaNNaNNaNNaN7.0NaN7.0NaNNaNNaNNaNNaNNaNNaNNaN6.08.025.0NaNNaN9.0NaN17.032.021.0NaN
1883NaNNaNNaN105.0NaNNaNNaNNaNNaN79.0NaNNaNNaNNaNNaNNaN43.012.0NaN11.0NaNNaN27.052.020.0NaN6.0NaN778.041.0NaN11.0NaNNaN107.0NaNNaNNaN24.0369.0NaN20.0NaNNaN15.085.0NaN14.068.030.0NaN82.0NaN16.0NaNNaNNaNNaNNaNNaN...11.07.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN13.0NaNNaNNaN6.0NaNNaN5.055.016.0NaN13.0NaNNaNNaN6.05.0NaN15.0NaNNaNNaN5.0NaNNaNNaNNaNNaNNaN23.0NaNNaN10.0NaN11.035.025.0NaN
1884NaNNaNNaN97.0NaNNaNNaNNaNNaN98.0NaNNaN6.0NaNNaNNaN45.014.0NaN13.0NaNNaN33.067.029.0NaNNaNNaN854.033.0NaN20.0NaNNaN83.0NaNNaNNaN18.0364.0NaN17.0NaNNaN11.098.07.017.071.037.07.0112.09.016.0NaNNaNNaNNaNNaNNaN...11.09.0NaNNaNNaNNaNNaNNaNNaN6.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN11.0NaNNaNNaNNaNNaN7.0NaN63.029.0NaN11.0NaNNaNNaNNaNNaNNaN10.09.0NaNNaNNaNNaN6.07.0NaN11.013.031.0NaNNaN14.06.08.058.027.0NaN
..............................................................................................................................................................................................................................................................................................................................................................................
2006NaN3737.0NaN8279.0NaNNaN297.0NaN404.0440.0630.0NaN1682.0NaNNaN219.0NaN922.0NaN15615.0297.0351.0NaN2200.0414.0316.0240.0NaN397.0NaNNaNNaNNaNNaN6775.0286.0NaN1098.0NaNNaNNaN8054.0470.0872.0NaN285.0NaNNaNNaNNaNNaN676.0NaNNaNNaNNaNNaN1401.0NaNNaN...NaNNaNNaNNaNNaNNaN228.0247.0221.0NaN1079.0NaN1409.0NaN312.0393.0349.0NaN248.0NaN224.0196.0NaNNaN336.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN1635.0NaNNaN5145.02839.0530.0NaNNaNNaNNaNNaNNaN
2007NaN3941.0NaN8914.0NaNNaN313.0NaN349.0468.0651.0NaN1573.0NaNNaN224.0NaN939.0NaN15447.0285.0314.0NaN2139.0463.0736.0279.0NaN460.0NaNNaNNaN316.0NaN6770.0285.0NaN1080.0NaNNaNNaN12281.0491.01380.0NaN409.0NaNNaNNaNNaNNaN839.0NaNNaNNaN335.0NaN1311.0NaN197.0...NaNNaNNaNNaNNaNNaN238.0267.0NaNNaN1052.0NaN1595.0291.0407.0414.0494.0NaN255.0NaN429.0201.0NaNNaN362.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN2069.0NaNNaN4925.03028.0526.0NaNNaNNaNNaNNaNNaN
2008955.04028.0219.08511.0NaNNaN317.0NaN344.0400.0608.0NaN1328.0199.0NaN210.0NaN863.0NaN15045.0NaN288.0NaN2143.0477.0585.0322.0NaN520.0NaNNaNNaN576.0328.06074.0NaNNaN1110.0NaNNaNNaN11008.0553.01428.0NaN555.0NaNNaNNaNNaNNaN910.0NaNNaNNaN527.0NaN1382.0NaNNaN...NaNNaN219.0NaNNaN231.0273.0255.0NaNNaN1115.0NaN1568.0316.0376.0442.0535.0NaN304.0NaN563.0267.0NaNNaN365.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN2027.0NaNNaN4764.03438.0492.0NaNNaNNaNNaNNaNNaN
20091265.04352.0270.07936.0NaNNaN296.0NaN307.0369.0675.0NaN1274.0229.0NaN256.0NaN960.0NaN14342.0271.0NaNNaN2088.0554.0477.0418.0NaN531.0NaNNaNNaN861.0433.05649.0NaNNaN1122.0NaNNaNNaN10883.0730.01451.0NaN534.0NaNNaNNaNNaNNaN919.0NaNNaNNaN777.0331.01363.0NaNNaN...NaNNaN199.0NaNNaN297.0295.0237.0NaNNaN1140.0NaN1511.0391.0364.0357.0602.0NaN245.0199.0744.0295.0NaNNaN339.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN1860.0NaNNaN5120.03981.0496.0NaNNaNNaNNaNNaNNaN
2010448.04628.0438.07374.0226.0NaN277.0NaN295.0324.0585.0NaN1140.0264.0NaN225.0NaN1119.0NaN14124.0282.0NaNNaN1899.0483.0395.0395.0NaN525.0NaNNaNNaN1261.0686.05062.0NaNNaN937.0NaNNaN260.010253.0793.01605.0NaN705.0NaNNaN285.0NaN281.0983.0NaNNaNNaN825.0458.01162.0NaNNaN...NaNNaN209.0NaNNaN397.0278.0222.0NaNNaN1106.0NaN1445.0370.0390.0323.0608.0304.0309.0NaN919.0318.0NaNNaN358.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN1926.0NaNNaN6200.05164.0504.0NaNNaNNaNNaNNaN258.0

131 rows × 6868 columns

total_births.info()
  • 1
<class 'pandas.core.frame.DataFrame'>
Int64Index: 131 entries, 1880 to 2010
Columns: 6868 entries, Aaden to Zuri
dtypes: float64(6868)
memory usage: 6.9 MB
  • 1
  • 2
  • 3
  • 4
  • 5
subset = total_births[['John','Harry','Mary','Marilyn']]
subset.plot(subplots=True,figsize=(12,10),grid=False,title='Number of births per year')
  • 1
  • 2
14.3.1.1 计量命名多样性的增加
table = top1000.pivot_table('prop',index='year',columns='sex',aggfunc=sum)
table.plot(title='Sum of table1000.prop by year and sex',
          yticks=np.linspace(0,1.2,13),
          xticks=range(1880,2020,10))
  • 1
  • 2
  • 3
  • 4
#另一个有趣的指标是不同名字的数量,按最高到最低的受欢迎程度在出生人数最高的50%的名字中排序
df = boys[boys.year==2010]
df
  • 1
  • 2
  • 3
namesexbirthsyeargroupprop
260877JacobM2187520100.0115230.011523
260878EthanM1786620100.0094110.009411
260879MichaelM1713320100.0090250.009025
260880JaydenM1703020100.0089710.008971
260881WilliamM1687020100.0088870.008887
.....................
261872CamiloM19420100.0001020.000102
261873DestinM19420100.0001020.000102
261874JaquanM19420100.0001020.000102
261875JaydanM19420100.0001020.000102
261876MaxtonM19320100.0001020.000102

1000 rows × 6 columns

prop_cumsum = df.sort_values(by='prop',ascending=False).prop.cumsum()
prop_cumsum[:10]
  • 1
  • 2
260877    0.011523
260878    0.020934
260879    0.029959
260880    0.038930
260881    0.047817
260882    0.056579
260883    0.065155
260884    0.073414
260885    0.081528
260886    0.089621
Name: prop, dtype: float64
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
prop_cumsum.values.searchsorted(0.5)
  • 1
116
  • 1
df = boys[boys.year==1900]
df
  • 1
  • 2
namesexbirthsyeargroupprop
40877JohnM983419000.0653190.065319
40878WilliamM858019000.0569900.056990
40879JamesM724619000.0481290.048129
40880GeorgeM540519000.0359010.035901
40881CharlesM410219000.0272460.027246
.....................
41872TheronM819000.0000530.000053
41873TerrellM819000.0000530.000053
41874SolonM819000.0000530.000053
41875RayfieldM819000.0000530.000053
41876SinclairM819000.0000530.000053

1000 rows × 6 columns

in1900 = df.sort_values(by='prop',ascending=False).prop.cumsum()
  • 1
in1900.values.searchsorted(0.5)+1
  • 1
25
  • 1
#你现在可以将此操作应用于每个年/性别分组,通过这些字段进行groupby,
#并将返回值是每个分组计数值的函数apply到每个分组上
def get_quantile_count(group,q=0.5):
    group = group.sort_values(by='prop',ascending=False)
    return group.prop.cumsum().values.searchsorted(q)+1

diversity = top1000.groupby(['year','sex']).apply(get_quantile_count)
diversity = diversity.unstack('sex')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
diversity.head()
  • 1
sexFM
year
18803814
18813814
18823815
18833915
18843916
#导入matplotlib相关库
import matplotlib.pyplot as plt

# 中文乱码的处理
plt.rcParams['font.sans-serif'] =['Microsoft YaHei']
plt.rcParams['axes.unicode_minus'] = False
diversity.plot(title='按年份划分的多样性指标图')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
14.3.1.2 “最后一个字母”革命
#从name列提取最后一个字母
get_last_letter = lambda x:x[-1]
last_letters = names.name.map(get_last_letter)
last_letters.name = 'last_letter'
table = names.pivot_table('births',index=last_letters,columns=['sex','year'],aggfunc=sum)
  • 1
  • 2
  • 3
  • 4
  • 5
table[:5]
  • 1
sexF...M
year188018811882188318841885188618871888188918901891189218931894189518961897189818991900190119021903190419051906190719081909191019111912191319141915191619171918191919201921192219231924192519261927192819291930193119321933193419351936193719381939...195119521953195419551956195719581959196019611962196319641965196619671968196919701971197219731974197519761977197819791980198119821983198419851986198719881989199019911992199319941995199619971998199920002001200220032004200520062007200820092010
last_letter
a31446.031581.036536.038330.043680.045408.049100.048942.059442.058631.062313.060582.068331.067821.070631.073002.073584.072148.079150.070712.089934.072186.077816.077130.080201.084080.083755.090326.093769.096160.0108376.0113117.0149133.0166038.0199759.0257348.0272192.0279747.0297646.0288607.0302392.0311731.0303382.0302884.0315612.0309535.0301188.0304129.0295654.0284469.0288291.0274399.0278899.0264132.0273476.0273152.0273131.0291960.0308275.0310346.0...4205.04267.04524.04665.04744.04936.05011.04877.05223.05204.05254.05328.05182.04820.04754.04622.04668.04833.05848.07016.08891.010279.013321.017716.021073.022550.028670.031439.037482.042396.045465.044614.042915.046549.049105.044776.047522.050337.054098.052158.050977.047271.045592.044441.044991.042739.041458.041281.040608.040837.039124.038815.037825.038650.036838.036156.034654.032901.031430.028438.0
bNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN5.0NaN5.0NaNNaNNaN11.010.06.05.015.0NaN11.012.06.09.014.011.011.020.010.021.012.08.010.08.019.013.08.0...1500.01512.01536.01545.01727.01980.02993.03686.04060.03912.03739.03454.03192.02817.02208.01732.01507.01746.01905.02114.02035.02535.02915.03835.04620.05581.07091.07486.09049.010139.011428.012288.013394.013602.014998.016479.017731.019708.023123.027942.032179.032837.035817.038226.040717.042791.046177.050330.050051.050892.050950.049284.048065.045914.043144.042600.042123.039945.038862.038859.0
cNaNNaN5.05.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN5.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN5.0NaNNaNNaN7.08.05.0NaN15.07.017.013.08.018.012.013.017.07.010.0NaN14.05.0NaN6.0NaN5.08.010.0NaN5.0...7408.07870.07953.08661.09430.09848.011368.012924.014629.015476.016373.017873.018783.020475.022643.022036.023520.024688.028242.032113.030950.029200.027973.027234.026708.028307.028289.027021.027798.030383.028039.027365.027625.028586.028552.027177.027963.028099.027549.028951.028259.027252.026423.026912.026330.026270.025848.026624.026160.026998.027113.027238.027697.026778.026078.026635.026864.025318.024048.023125.0
d609.0607.0734.0810.0916.0862.01007.01027.01298.01374.01438.01512.01775.01821.01985.02268.02372.02455.02953.03028.03670.03146.03499.03844.04260.04591.04722.05110.05457.05929.06750.07509.010518.011907.014008.018440.019038.019940.021092.020949.021807.021711.020177.019623.019338.017954.017085.016522.014958.013361.012124.010670.09916.08698.08164.07490.06915.06554.06317.06109.0...273057.0283164.0285612.0289767.0287895.0285524.0283833.0272473.0266287.0262112.0257912.0249965.0247275.0240636.0219135.0209407.0206098.0199972.0197929.0196568.0178707.0157053.0142518.0139114.0133541.0128164.0130341.0126353.0129637.0129375.0124490.0122069.0115659.0112215.0110894.0108526.0107076.0105127.0105727.0101968.093858.087586.082541.077163.072313.070157.069036.067683.065507.064251.060838.055829.053391.051754.050670.051410.050595.047910.046172.044398.0
e33378.034080.040399.041914.048089.049616.053884.054353.066750.066663.070948.067750.077186.076455.079938.083875.084355.082783.091151.081395.0107080.083223.092643.090666.094631.0100724.0101128.0108114.0112484.0116731.0133569.0136484.0180466.0199255.0242133.0307668.0324955.0335945.0357952.0351089.0364800.0372710.0362228.0358140.0365030.0355336.0342013.0338373.0321941.0307686.0305386.0288003.0286406.0270029.0275930.0270914.0270141.0273854.0280252.0271897.0...170371.0171645.0170356.0173053.0171361.0175848.0183280.0182223.0183636.0178823.0173033.0164949.0158311.0150163.0133372.0125957.0119826.0117229.0120870.0127310.0121332.0112327.0106959.0105232.0104515.0105420.0105290.0103935.0108840.0112343.0112976.0114190.0114382.0113981.0122448.0125673.0126461.0130176.0139160.0146489.0146218.0149738.0147895.0145682.0140838.0142438.0141857.0144854.0145047.0148821.0145395.0144651.0144769.0142098.0141123.0142999.0143698.0140966.0135496.0129012.0

5 rows × 262 columns

subtable = table.reindex(columns=[1910,1960,2010],level='year')
subtable.head()
  • 1
  • 2
sexFM
year191019602010191019602010
last_letter
a108376.0691247.0670605.0977.05204.028438.0
bNaN694.0450.0411.03912.038859.0
c5.049.0946.0482.015476.023125.0
d6750.03729.02607.022111.0262112.044398.0
e133569.0435013.0313833.028655.0178823.0129012.0
#按照出生总数对表格进行归一化处理,计算一个新表格,其中包含每个性别的每个结束字母占总出生数的比例
subtable.sum()
  • 1
  • 2
sex  year
F    1910     396416.0
     1960    2022062.0
     2010    1759010.0
M    1910     194198.0
     1960    2132588.0
     2010    1898382.0
dtype: float64
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
letter_prop = subtable/subtable.sum()
letter_prop[:10]
  • 1
  • 2
sexFM
year191019602010191019602010
last_letter
a0.2733900.3418530.3812400.0050310.0024400.014980
bNaN0.0003430.0002560.0021160.0018340.020470
c0.0000130.0000240.0005380.0024820.0072570.012181
d0.0170280.0018440.0014820.1138580.1229080.023387
e0.3369410.2151330.1784150.1475560.0838530.067959
fNaN0.0000100.0000550.0007830.0043250.001188
g0.0001440.0001570.0003740.0022500.0094880.001404
h0.0515290.0362240.0758520.0455620.0379070.051670
i0.0015260.0399650.0317340.0008440.0006030.022628
jNaNNaN0.000090NaNNaN0.000769
#现在根据掌握的字母比例,我们可以绘制出按年划分的每个性别的条形图
import matplotlib.pyplot as plt
fig,axes = plt.subplots(2,1,figsize=(10,8))
letter_prop['M'].plot(kind='bar',rot=0,ax=axes[0],title='Male')
letter_prop['F'].plot(kind='bar',rot=0,ax=axes[1],title='Female')
  • 1
  • 2
  • 3
  • 4
  • 5
letter_prop = table/table.sum()
dny_ts = letter_prop.loc[['d','n','y'],'M'].T
dny_ts.head()
  • 1
  • 2
  • 3
last_letterdny
year
18800.0830550.1532130.075760
18810.0832470.1532140.077451
18820.0853400.1495600.077537
18830.0840660.1516460.079144
18840.0861200.1499150.080405
dny_ts.plot(title = '随着时间推移名字以d/n/y结尾的男孩的比例变化趋势')
  • 1
14.3.1.3 男孩名字变成女孩名字(以及反向)
all_names = pd.Series(top1000.name.unique())
all_names[:10]
  • 1
  • 2
0         Mary
1         Anna
2         Emma
3    Elizabeth
4       Minnie
5     Margaret
6          Ida
7        Alice
8       Bertha
9        Sarah
dtype: object
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
lesley_like = all_names[all_names.str.lower().str.contains('lesl')]
lesley_like
  • 1
  • 2
632     Leslie
2294    Lesley
4262    Leslee
4728     Lesli
6103     Lesly
dtype: object
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
filtered = top1000[top1000.name.isin(lesley_like)]
filtered.groupby('name').births.sum()
  • 1
  • 2
name
Leslee      1082
Lesley     35022
Lesli        929
Leslie    370429
Lesly      10067
Name: births, dtype: int64
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
#让我们按性别和年份进行聚合,并在年内进行标准化
table = filtered.pivot_table('births',index='year',columns='sex',aggfunc='sum')
table = table.div(table.sum(1),axis=0)
table.tail()
  • 1
  • 2
  • 3
  • 4
sexFM
year
20061.0NaN
20071.0NaN
20081.0NaN
20091.0NaN
20101.0NaN
table.plot(style={'M':'k-','F':'k--'})
  • 1

14.4 美国农业部食品数据库

import json
db = json.load(open('datasets/usda_food/database.json'))
len(db)
  • 1
  • 2
  • 3
6636
  • 1
#db中的每个条目都是一个包含单个食物所有数据的词典。
#'nutrients’字段是一个字典的列表,每个营养元素对应一个字典
db[0].keys()
  • 1
  • 2
  • 3
dict_keys(['id', 'description', 'tags', 'manufacturer', 'group', 'portions', 'nutrients'])
  • 1
db[0]['nutrients'][0]
  • 1
{'value': 25.18,
 'units': 'g',
 'description': 'Protein',
 'group': 'Composition'}
  • 1
  • 2
  • 3
  • 4
nutrients = pd.DataFrame(db[0]['nutrients'])
  • 1
nutrients[:7]
  • 1
valueunitsdescriptiongroup
025.18gProteinComposition
129.20gTotal lipid (fat)Composition
23.06gCarbohydrate, by differenceComposition
33.28gAshOther
4376.00kcalEnergyEnergy
539.28gWaterComposition
61573.00kJEnergyEnergy
#将字典的列表转换为DataFrame时,
#我们可以指定一个需要提取的字段列表。我们将提取食物名称、分类、ID和制造商
info_keys = ['description','group','id','manufacturer']
info = pd.DataFrame(db,columns=info_keys)
  • 1
  • 2
  • 3
  • 4
info[-10:]
  • 1
descriptiongroupidmanufacturer
6626CAMPBELL Soup Company, V8 Vegetable Juice, Ess...Vegetables and Vegetable Products31010Campbell Soup Co.
6627CAMPBELL Soup Company, V8 Vegetable Juice, Spi...Vegetables and Vegetable Products31013Campbell Soup Co.
6628CAMPBELL Soup Company, PACE, Jalapenos Nacho S...Vegetables and Vegetable Products31014Campbell Soup Co.
6629CAMPBELL Soup Company, V8 60% Vegetable Juice,...Vegetables and Vegetable Products31016Campbell Soup Co.
6630CAMPBELL Soup Company, V8 Vegetable Juice, Low...Vegetables and Vegetable Products31017Campbell Soup Co.
6631Bologna, beef, low fatSausages and Luncheon Meats42161
6632Turkey and pork sausage, fresh, bulk, patty or...Sausages and Luncheon Meats42173
6633Babyfood, juice, pearBaby Foods43408None
6634Babyfood, dessert, banana yogurt, strainedBaby Foods43539None
6635Babyfood, banana no tapioca, strainedBaby Foods43546None
info.info()
  • 1
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6636 entries, 0 to 6635
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   description   6636 non-null   object
 1   group         6636 non-null   object
 2   id            6636 non-null   int64 
 3   manufacturer  5195 non-null   object
dtypes: int64(1), object(3)
memory usage: 207.5+ KB
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
#可以通过value_counts查看食物组的分布情况
pd.value_counts(info.group)[:10]
  • 1
  • 2
Vegetables and Vegetable Products    812
Beef Products                        618
Baked Products                       496
Breakfast Cereals                    403
Legumes and Legume Products          365
Fast Foods                           365
Lamb, Veal, and Game Products        345
Sweets                               341
Fruits and Fruit Juices              328
Pork Products                        328
Name: group, dtype: int64
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

14.5 2012年联邦选举委员会数据库

  • 美国联邦选举委员会公布了有关政治运动贡献的数据。这些数据包括捐赠者姓名、职业和雇主、地址和缴费金额。
fec = pd.read_csv('datasets/fec/P00000001-ALL.csv')
fec.info()
  • 1
  • 2
D:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:3165: DtypeWarning: Columns (6) have mixed types.Specify dtype option on import or set low_memory=False.
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1001731 entries, 0 to 1001730
Data columns (total 16 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   cmte_id            1001731 non-null  object 
 1   cand_id            1001731 non-null  object 
 2   cand_nm            1001731 non-null  object 
 3   contbr_nm          1001731 non-null  object 
 4   contbr_city        1001712 non-null  object 
 5   contbr_st          1001727 non-null  object 
 6   contbr_zip         1001620 non-null  object 
 7   contbr_employer    988002 non-null   object 
 8   contbr_occupation  993301 non-null   object 
 9   contb_receipt_amt  1001731 non-null  float64
 10  contb_receipt_dt   1001731 non-null  object 
 11  receipt_desc       14166 non-null    object 
 12  memo_cd            92482 non-null    object 
 13  memo_text          97770 non-null    object 
 14  form_tp            1001731 non-null  object 
 15  file_num           1001731 non-null  int64  
dtypes: float64(1), int64(1), object(14)
memory usage: 122.3+ MB
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
fec.iloc[123456]
  • 1
cmte_id                             C00431445
cand_id                             P80003338
cand_nm                         Obama, Barack
contbr_nm                         ELLMAN, IRA
contbr_city                             TEMPE
contbr_st                                  AZ
contbr_zip                          852816719
contbr_employer      ARIZONA STATE UNIVERSITY
contbr_occupation                   PROFESSOR
contb_receipt_amt                        50.0
contb_receipt_dt                    01-DEC-11
receipt_desc                              NaN
memo_cd                                   NaN
memo_text                                 NaN
form_tp                                 SA17A
file_num                               772372
Name: 123456, dtype: object
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
#可以使用unique获得所有不同的政治候选人名单
unique_cands = fec.cand_nm.unique()
unique_cands
  • 1
  • 2
  • 3
array(['Bachmann, Michelle', 'Romney, Mitt', 'Obama, Barack',
       "Roemer, Charles E. 'Buddy' III", 'Pawlenty, Timothy',
       'Johnson, Gary Earl', 'Paul, Ron', 'Santorum, Rick',
       'Cain, Herman', 'Gingrich, Newt', 'McCotter, Thaddeus G',
       'Huntsman, Jon', 'Perry, Rick'], dtype=object)
  • 1
  • 2
  • 3
  • 4
  • 5
unique_cands[2]
  • 1
'Obama, Barack'
  • 1
parties = {'Bachmann, Michelle':'Republican',
            'Romney, Mitt':'Republican',
            'Obama, Barack':'Democrat',
            "Roemer, Charles E. 'Buddy' III":'Republican',
            'Pawlenty, Timothy':'Republican',
            'Johnson, Gary Earl':'Republican',
            'Paul, Ron':'Republican',
            'Santorum, Rick':'Republican',
            'Cain, Herman':'Republican',
            'Gingrich, Newt':'Republican',
            'McCotter, Thaddeus G':'Republican',
            'Huntsman, Jon':'Republican',
            'Perry, Rick':'Republican'}

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
fec.cand_nm[123456:123461]
  • 1
123456    Obama, Barack
123457    Obama, Barack
123458    Obama, Barack
123459    Obama, Barack
123460    Obama, Barack
Name: cand_nm, dtype: object
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
fec.cand_nm[123456:123461].map(parties)
  • 1
123456    Democrat
123457    Democrat
123458    Democrat
123459    Democrat
123460    Democrat
Name: cand_nm, dtype: object
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
fec['party'] = fec.cand_nm.map(parties)
  • 1
fec['party'].value_counts()
  • 1
Democrat      593746
Republican    407985
Name: party, dtype: int64
  • 1
  • 2
  • 3
#首先,这些数据既包括捐款也包括退款(即负贡献金额)
(fec.contb_receipt_amt>0).value_counts()
  • 1
  • 2
True     991475
False     10256
Name: contb_receipt_amt, dtype: int64
  • 1
  • 2
  • 3
fec = fec[fec.contb_receipt_amt>0]
  • 1
fec_mrbo = fec[fec.cand_nm.isin(['Obama, Barack','Romney, Mitt'])] 
  • 1

14.5.1 按职业和雇主的捐献统计

fec.contbr_occupation.value_counts()[:10]
  • 1
RETIRED                                   233990
INFORMATION REQUESTED                      35107
ATTORNEY                                   34286
HOMEMAKER                                  29931
PHYSICIAN                                  23432
INFORMATION REQUESTED PER BEST EFFORTS     21138
ENGINEER                                   14334
TEACHER                                    13990
CONSULTANT                                 13273
PROFESSOR                                  12555
Name: contbr_occupation, dtype: int64
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
occ_mapping = {'INFORMATION REQUESTED PER BEST EFFORTS':'NOT PROVIDED',
              'INFORMATION REQUESTED':'NOT PROVIDED',
              'INFORMATION REQUESTED(BEST EFFORTS)':'NOT PROVIDED',
              'C.E.O':'CEO'}

#如果没有映射,则返回x
f = lambda x :occ_mapping.get(x,x)
fec.contbr_occupation = fec.contbr_occupation.map(f)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
occ_mapping = {'INFORMATION REQUESTED PER BEST EFFORTS':'NOT PROVIDED',
              'INFORMATION REQUESTED':'NOT PROVIDED',
               'SELF':'SELF-EMPLOYED',
               'SELF EMPLOYED':'SELF-EMPLOYED'
               }

#如果没有映射,则返回x
f = lambda x :occ_mapping.get(x,x)
fec.contbr_employer = fec.contbr_employer.map(f)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
by_occupation = fec.pivot_table('contb_receipt_amt',index='contbr_occupation',
                                columns='party',aggfunc='sum')
  • 1
  • 2
over_2mm = by_occupation[by_occupation.sum(1)>2000000]
over_2mm
  • 1
  • 2
partyDemocratRepublican
contbr_occupation
ATTORNEY11141982.977.477194e+06
C.E.O.1690.002.592983e+06
CEO2074284.791.640758e+06
CONSULTANT2459912.712.544725e+06
ENGINEER951525.551.818374e+06
EXECUTIVE1355161.054.138850e+06
HOMEMAKER4248875.801.363428e+07
INVESTOR884133.002.431769e+06
LAWYER3160478.873.912243e+05
MANAGER762883.221.444532e+06
NOT PROVIDED4866973.962.023715e+07
OWNER1001567.362.408287e+06
PHYSICIAN3735124.943.594320e+06
PRESIDENT1878509.954.720924e+06
PROFESSOR2165071.082.967027e+05
REAL ESTATE528902.091.625902e+06
RETIRED25305116.382.356124e+07
SELF-EMPLOYED672393.401.640253e+06
over_2mm.plot(kind='barh')
  • 1
def get_top_amounts(group,key,n=5):
    totals = group.groupby(key)['contb_receipt_amt'].sum()
    return totals.nlargest(n)
  • 1
  • 2
  • 3
grouped = fec_mrbo.groupby('cand_nm')
grouped.apply(get_top_amounts,'contbr_occupation',n=7)
  • 1
  • 2
cand_nm        contbr_occupation                     
Obama, Barack  RETIRED                                   25305116.38
               ATTORNEY                                  11141982.97
               INFORMATION REQUESTED                      4866973.96
               HOMEMAKER                                  4248875.80
               PHYSICIAN                                  3735124.94
               LAWYER                                     3160478.87
               CONSULTANT                                 2459912.71
Romney, Mitt   RETIRED                                   11508473.59
               INFORMATION REQUESTED PER BEST EFFORTS    11396894.84
               HOMEMAKER                                  8147446.22
               ATTORNEY                                   5364718.82
               PRESIDENT                                  2491244.89
               EXECUTIVE                                  2300947.03
               C.E.O.                                     1968386.11
Name: contb_receipt_amt, dtype: float64
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16

14.5.2 捐赠金额分桶

bins = np.array([0,1,10,100,1000,10000,100000,1000000,10000000])
  • 1
labels = pd.cut(fec_mrbo.contb_receipt_amt,bins)
labels
  • 1
  • 2
411         (10, 100]
412       (100, 1000]
413       (100, 1000]
414         (10, 100]
415         (10, 100]
             ...     
701381      (10, 100]
701382    (100, 1000]
701383        (1, 10]
701384      (10, 100]
701385    (100, 1000]
Name: contb_receipt_amt, Length: 694282, dtype: category
Categories (8, interval[int64]): [(0, 1] < (1, 10] < (10, 100] < (100, 1000] < (1000, 10000] < (10000, 100000] < (100000, 1000000] < (1000000, 10000000]]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
grouped = fec_mrbo.groupby(['cand_nm',labels])
grouped.size().unstack(0)
  • 1
  • 2
cand_nmObama, BarackRomney, Mitt
contb_receipt_amt
(0, 1]49377
(1, 10]400703681
(10, 100]37228031853
(100, 1000]15399143357
(1000, 10000]2228426186
(10000, 100000]21
(100000, 1000000]30
(1000000, 10000000]40
bucket_sum = grouped.contb_receipt_amt.sum().unstack(0)
normed_sums = bucket_sum.div(bucket_sum.sum(axis=1),axis=0)
normed_sums
  • 1
  • 2
  • 3
cand_nmObama, BarackRomney, Mitt
contb_receipt_amt
(0, 1]0.8051820.194818
(1, 10]0.9187670.081233
(10, 100]0.9107690.089231
(100, 1000]0.7101760.289824
(1000, 10000]0.4473260.552674
(10000, 100000]0.8231200.176880
(100000, 1000000]1.0000000.000000
(1000000, 10000000]1.0000000.000000
normed_sums[:-2].plot(kind='barh')
  • 1

14.5.3 按州进行捐赠统计

#将数据按照候选人和州进行聚合是一项常规分析
grouped = fec_mrbo.groupby(['cand_nm','contbr_st'])
totals = grouped.contb_receipt_amt.sum().unstack(0).fillna(0)
totals[totals.sum(1)>100000]
totals[:10]
  • 1
  • 2
  • 3
  • 4
  • 5
cand_nmObama, BarackRomney, Mitt
contbr_st
AA56405.00135.00
AB2048.000.00
AE42973.755680.00
AK281840.1586204.24
AL543123.48527303.51
AP37130.501655.00
AR359247.28105556.00
AS2955.000.00
AZ1506476.981888436.23
CA23824984.2411237636.60
percent = totals.div(totals.sum(1),axis=0)
percent[:10]
  • 1
  • 2
cand_nmObama, BarackRomney, Mitt
contbr_st
AA0.9976120.002388
AB1.0000000.000000
AE0.8832570.116743
AK0.7657780.234222
AL0.5073900.492610
AP0.9573290.042671
AR0.7729020.227098
AS1.0000000.000000
AZ0.4437450.556255
CA0.6794980.320502
# 附录A高阶NumPy

A.1 ndarray对象内幕

  • ndarray内部包含以下内容
  • 指向数据的指针——即RAM中或内存映射文件中的数据块
  • 数据类型或dtype,描述数组中固定大小的值单元格
  • 表示数组形状(shape)的元组
  • 步长元组,表示要“步进”的字节数的整数以便沿维度推进一个元素
np.ones((10,5)).shape
  • 1
(10, 5)
  • 1
#一个典型的(C阶)3×4×5 float64值(8字节)的数组具有跨度(160,40,8)
np.ones((3,4,5),dtype=np.float64).strides
  • 1
  • 2
(160, 40, 8)
  • 1

A.1.1 NumPy dtype层次结构

ints = np.ones(10,dtype=np.uint16)
ints
  • 1
  • 2
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=uint16)
  • 1
floats = np.ones(10,dtype=np.float32)
floats
  • 1
  • 2
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], dtype=float32)
  • 1
np.issubdtype(ints.dtype,np.integer)
  • 1
True
  • 1
np.issubdtype(floats.dtype,np.floating)
  • 1
True
  • 1
#可以通过调用类型的mro方法来查看特定dtype的所有父类
np.float64.mro()
  • 1
  • 2
[numpy.float64,
 numpy.floating,
 numpy.inexact,
 numpy.number,
 numpy.generic,
 float,
 object]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
np.issubdtype(ints.dtype,np.number)
  • 1
True
  • 1

A.2 高阶数组操作

A.2.1 重塑数组

arr = np.arange(8)
arr
  • 1
  • 2
array([0, 1, 2, 3, 4, 5, 6, 7])
  • 1
#在很多情况下,你将数组从一个形状转换为另一个形状,并且不复制任何数据
arr.reshape(4,2)
  • 1
  • 2
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7]])
  • 1
  • 2
  • 3
  • 4
#多维数组也可以被重塑
arr.reshape(4,2).reshape(2,4)
  • 1
  • 2
array([[0, 1, 2, 3],
       [4, 5, 6, 7]])
  • 1
  • 2
#传递的形状维度可以有一个值是-1,表示维度通过数据进行推断
arr = np.arange(15)
arr.reshape((5,-1))
  • 1
  • 2
  • 3
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])
  • 1
  • 2
  • 3
  • 4
  • 5
#由于数组的shape属性是一个元组,它也可以被传递给reshape
other_arr = np.ones((3,5))
other_arr.shape
  • 1
  • 2
  • 3
(3, 5)
  • 1
arr.reshape(other_arr.shape)
  • 1
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])
  • 1
  • 2
  • 3
#reshape的反操作可以将更高维度的数组转换为一维数组,
#这种操作通常被成为扁平化(flattening)或分散化(raveling)
arr = np.arange(15).reshape((5,3))
arr
  • 1
  • 2
  • 3
  • 4
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])
  • 1
  • 2
  • 3
  • 4
  • 5
arr.ravel()
#如果结果中的值在原始数组中是连续的,则ravel不会生成底层数值的副本。
  • 1
  • 2
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])
  • 1
arr.flatten()
#flatten方法的行为类似于ravel,但它总是返回数据的副本
  • 1
  • 2
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])
  • 1

A.2.2 C顺序和Fortran顺序

  • C顺序/行方向顺序
  • 首先遍历更高的维度(例如,在轴0上行进之前先在轴1上行进)
  • Fortran顺序/列方向顺序
  • 最后遍历更高的维度(例如,在轴1上行进之前先在轴0上行进)
arr = np.arange(12).reshape((3,4))
arr
  • 1
  • 2
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
  • 1
  • 2
  • 3
arr.ravel()
  • 1
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])
  • 1
arr.ravel('F')
  • 1
array([ 0,  4,  8,  1,  5,  9,  2,  6, 10,  3,  7, 11])
  • 1

A.2.3 连接和分隔数组

  • numpy.concatenate可以获取数组的序列(元组、列表等),并沿着输入轴将它们按顺序连接在一起
arr1 = np.array([[1,2,3],[4,5,6]])
arr1
  • 1
  • 2
array([[1, 2, 3],
       [4, 5, 6]])
  • 1
  • 2
arr2 = np.array([[7,8,9],[10,11,12]])
arr2
  • 1
  • 2
array([[ 7,  8,  9],
       [10, 11, 12]])
  • 1
  • 2
np.concatenate([arr1,arr2],axis=0)
  • 1
array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])
  • 1
  • 2
  • 3
  • 4
np.concatenate([arr1,arr2],axis=1)
  • 1
array([[ 1,  2,  3,  7,  8,  9],
       [ 4,  5,  6, 10, 11, 12]])
  • 1
  • 2
np.vstack((arr1,arr2))
  • 1
array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 7,  8,  9],
       [10, 11, 12]])
  • 1
  • 2
  • 3
  • 4
np.hstack((arr1,arr2))
  • 1
array([[ 1,  2,  3,  7,  8,  9],
       [ 4,  5,  6, 10, 11, 12]])
  • 1
  • 2
arr = np.random.randn(5,2)
arr
  • 1
  • 2
array([[-0.37933271, -1.04852791],
       [-0.3278915 ,  1.11594819],
       [ 0.77077511, -1.19903381],
       [ 0.38477425, -0.35244269],
       [ 1.38135852, -0.10439573]])
  • 1
  • 2
  • 3
  • 4
  • 5
#split可以将一个数组沿轴向切片成多个数组
#传递给np.split的值[1,3]表示将数组拆分时的索引位置
first,second,third = np.split(arr,[2,3])
first
  • 1
  • 2
  • 3
  • 4
array([[-0.37933271, -1.04852791],
       [-0.3278915 ,  1.11594819]])
  • 1
  • 2
second
  • 1
array([[ 0.77077511, -1.19903381]])
  • 1
third
  • 1
array([[ 0.38477425, -0.35244269],
       [ 1.38135852, -0.10439573]])
  • 1
  • 2
  • 数组连接函数
函数描述
concatenate最通用的函数,沿一个轴向连接数组的集合
vstack, row_ _stack按行堆叠数组(沿着轴0)
hstack按列堆叠数组(沿着轴1)
column_ stack类似于hstack,但会首先把1维数组转换为2维列向量
dstack按“深度”堆叠数组(沿着轴2)
split沿着指定的轴,在传递的位置上分隔数组
hsplit/vsplit分别是沿着轴0和轴1进行分隔的方便函数
A.2.3.1 堆叠助手:r 和c
arr = np.arange(6)
arr
  • 1
  • 2
array([0, 1, 2, 3, 4, 5])
  • 1
arr1= arr.reshape((3,2))
arr1
  • 1
  • 2
array([[0, 1],
       [2, 3],
       [4, 5]])
  • 1
  • 2
  • 3
arr2 = np.random.randn(3,2)
arr2
  • 1
  • 2
array([[-2.17693174,  1.20516725],
       [-0.44083574,  0.84645799],
       [ 0.02369097,  0.63556261]])
  • 1
  • 2
  • 3
np.r_[arr1,arr2]
  • 1
array([[ 0.        ,  1.        ],
       [ 2.        ,  3.        ],
       [ 4.        ,  5.        ],
       [-2.17693174,  1.20516725],
       [-0.44083574,  0.84645799],
       [ 0.02369097,  0.63556261]])
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
np.c_[arr1,arr2]
  • 1
array([[ 0.        ,  1.        , -2.17693174,  1.20516725],
       [ 2.        ,  3.        , -0.44083574,  0.84645799],
       [ 4.        ,  5.        ,  0.02369097,  0.63556261]])
  • 1
  • 2
  • 3
np.c_[np.r_[arr1,arr2],arr]
  • 1
array([[ 0.        ,  1.        ,  0.        ],
       [ 2.        ,  3.        ,  1.        ],
       [ 4.        ,  5.        ,  2.        ],
       [-2.17693174,  1.20516725,  3.        ],
       [-0.44083574,  0.84645799,  4.        ],
       [ 0.02369097,  0.63556261,  5.        ]])
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
#可以将切片转换为数组
np.c_[1:6,-10:-5]
  • 1
  • 2
array([[  1, -10],
       [  2,  -9],
       [  3,  -8],
       [  4,  -7],
       [  5,  -6]])
  • 1
  • 2
  • 3
  • 4
  • 5

A.2.4 重复元素:tile和repeat

  • repeat和tile函数是用于重复或复制数组的两个有用的工具。
  • repeat函数按照给定次数对数组中的每个元素进行复制,生成一个更大的数组
arr = np.arange(3)
arr
  • 1
  • 2
array([0, 1, 2])
  • 1
arr.repeat(3)
  • 1
array([0, 0, 0, 1, 1, 1, 2, 2, 2])
  • 1
arr.repeat([2,3,4])
  • 1
array([0, 0, 1, 1, 1, 2, 2, 2, 2])
  • 1
#多维数组可以在指定的轴向上对它们的元素进行重复
arr = np.random.randn(2,2)
arr
  • 1
  • 2
  • 3
array([[-0.86642515, -0.21137086],
       [ 0.4945539 , -0.02745328]])
  • 1
  • 2
arr.repeat(2,axis=0)
  • 1
array([[-0.86642515, -0.21137086],
       [-0.86642515, -0.21137086],
       [ 0.4945539 , -0.02745328],
       [ 0.4945539 , -0.02745328]])
  • 1
  • 2
  • 3
  • 4
#请注意,如果没有传递轴,数组将首先扁平化,这可能不是你想要的
arr.repeat(2)
  • 1
  • 2
array([-0.86642515, -0.86642515, -0.21137086, -0.21137086,  0.4945539 ,
        0.4945539 , -0.02745328, -0.02745328])
  • 1
  • 2
#需要按照不同次数重复多维数组的切片时,你可以传递一个整数数组
arr.repeat([2,3],axis=0)
  • 1
  • 2
array([[-0.86642515, -0.21137086],
       [-0.86642515, -0.21137086],
       [ 0.4945539 , -0.02745328],
       [ 0.4945539 , -0.02745328],
       [ 0.4945539 , -0.02745328]])
  • 1
  • 2
  • 3
  • 4
  • 5
arr.repeat([2,3],axis=1)
  • 1
array([[-0.86642515, -0.86642515, -0.21137086, -0.21137086, -0.21137086],
       [ 0.4945539 ,  0.4945539 , -0.02745328, -0.02745328, -0.02745328]])
  • 1
  • 2
#tile是一种快捷方法,它可以沿着轴向堆叠副本。在视觉上,你可以把它看作类似于“铺设瓷砖”
arr
  • 1
  • 2
array([[-0.86642515, -0.21137086],
       [ 0.4945539 , -0.02745328]])
  • 1
  • 2
np.tile(arr,2)
  • 1
array([[-0.86642515, -0.21137086, -0.86642515, -0.21137086],
       [ 0.4945539 , -0.02745328,  0.4945539 , -0.02745328]])
  • 1
  • 2
np.tile(arr,(2,1))
  • 1
array([[-0.86642515, -0.21137086],
       [ 0.4945539 , -0.02745328],
       [-0.86642515, -0.21137086],
       [ 0.4945539 , -0.02745328]])
  • 1
  • 2
  • 3
  • 4
#tile的第二个参数可以是表示“铺瓷砖”布局的元组
np.tile(arr,(2,2))
  • 1
  • 2
array([[-0.86642515, -0.21137086, -0.86642515, -0.21137086],
       [ 0.4945539 , -0.02745328,  0.4945539 , -0.02745328],
       [-0.86642515, -0.21137086, -0.86642515, -0.21137086],
       [ 0.4945539 , -0.02745328,  0.4945539 , -0.02745328]])
  • 1
  • 2
  • 3
  • 4

A.2.5 神奇索引的等价方法:take和put

arr = np.arange(10)*100
arr
  • 1
  • 2
array([  0, 100, 200, 300, 400, 500, 600, 700, 800, 900])
  • 1
inds = [7,1,2,6]
arr[inds]
  • 1
  • 2
array([700, 100, 200, 600])
  • 1
arr.take(inds)
  • 1
array([700, 100, 200, 600])
  • 1
arr.put(inds,42)
arr
  • 1
  • 2
array([  0,  42,  42, 300, 400, 500,  42,  42, 800, 900])
  • 1
arr.put(inds,[40,41,42,43])
arr
  • 1
  • 2
array([  0,  41,  42, 300, 400, 500,  43,  40, 800, 900])
  • 1
#如果要在别的轴上使用take,你可以传递axis关键字
inds = [2,0,2,1]
arr = np.random.randn(2,4)
arr
  • 1
  • 2
  • 3
  • 4
array([[ 0.42067458,  1.11465134,  0.80097006, -0.37064359],
       [-0.57974434,  1.24554556,  0.25903436, -0.10895085]])
  • 1
  • 2
arr.take(inds,axis=1)
  • 1
array([[ 0.80097006,  0.42067458,  0.80097006,  1.11465134],
       [ 0.25903436, -0.57974434,  0.25903436,  1.24554556]])
  • 1
  • 2

A.3 广播

  • 广播描述了算法如何在不同形状的数组之间进行运算。
  • 广播的规则
  • 如果对于每个结尾维度(即从尾部开始的),轴长度都匹配或者长度都是1,两个二维数组就是可以兼容广播的。之后,广播会在丢失的或长度为1的轴上进行。
arr = np.arange(5)
arr
  • 1
  • 2
array([0, 1, 2, 3, 4])
  • 1
#这里我们说标量值4已经被广播给乘法运算中的所有其他元素
arr*4
  • 1
  • 2
array([ 0,  4,  8, 12, 16])
  • 1
arr = np.random.randn(4,3)
arr
  • 1
  • 2
array([[-0.26130828,  0.21031853,  0.09806178],
       [-1.89409267, -0.30607457,  1.14174612],
       [-0.04140891, -1.4256403 ,  0.17503634],
       [ 0.94815936, -0.47780023, -0.17362592]])
  • 1
  • 2
  • 3
  • 4
arr.mean(0)
  • 1
array([-0.31216263, -0.49979914,  0.31030458])
  • 1
demeaned = arr - arr.mean(0)
demeaned
  • 1
  • 2
array([[ 0.05085435,  0.71011767, -0.2122428 ],
       [-1.58193004,  0.19372457,  0.83144154],
       [ 0.27075371, -0.92584116, -0.13526824],
       [ 1.26032198,  0.02199892, -0.4839305 ]])
  • 1
  • 2
  • 3
  • 4
demeaned.mean(0)
  • 1
array([5.55111512e-17, 1.38777878e-17, 1.38777878e-17])
  • 1
arr
  • 1
array([[-0.26130828,  0.21031853,  0.09806178],
       [-1.89409267, -0.30607457,  1.14174612],
       [-0.04140891, -1.4256403 ,  0.17503634],
       [ 0.94815936, -0.47780023, -0.17362592]])
  • 1
  • 2
  • 3
  • 4
row_means = arr.mean(1)
row_means
  • 1
  • 2
array([ 0.01569068, -0.35280704, -0.43067096,  0.09891107])
  • 1
row_means.shape
  • 1
(4,)
  • 1
row_means.reshape((4,1))
  • 1
array([[ 0.01569068],
       [-0.35280704],
       [-0.43067096],
       [ 0.09891107]])
  • 1
  • 2
  • 3
  • 4
demeaned = arr - row_means.reshape((4,1))
demeaned.mean(1)
  • 1
  • 2
array([4.62592927e-18, 7.40148683e-17, 7.40148683e-17, 0.00000000e+00])
  • 1

A.3.1 在其他轴上广播

arr - arr.mean(1)
  • 1
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-111-8b8ada26fac0> in <module>
----> 1 arr - arr.mean(1)


ValueError: operands could not be broadcast together with shapes (4,3) (4,) 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
arr - arr.mean(1).reshape((4,1))
  • 1
array([[-0.27699896,  0.19462785,  0.0823711 ],
       [-1.54128563,  0.04673247,  1.49455316],
       [ 0.38926205, -0.99496934,  0.6057073 ],
       [ 0.84924828, -0.5767113 , -0.27253699]])
  • 1
  • 2
  • 3
  • 4
#使用特殊的np.newaxis属性和“完整”切片来插入新轴
arr = np.zeros((4,4))
arr
  • 1
  • 2
  • 3
array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])
  • 1
  • 2
  • 3
  • 4
arr_3d = arr[:,np.newaxis,:]
arr_3d
  • 1
  • 2
array([[[0., 0., 0., 0.]],

       [[0., 0., 0., 0.]],

       [[0., 0., 0., 0.]],

       [[0., 0., 0., 0.]]])
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
arr_3d.shape
  • 1
(4, 1, 4)
  • 1
arr_1d = np.random.normal(size=3)
arr_1d[:,np.newaxis]
  • 1
  • 2
array([[-0.44142019],
       [ 0.19138049],
       [ 1.70465573]])
  • 1
  • 2
  • 3
arr_1d[np.newaxis,:]
  • 1
array([[-0.44142019,  0.19138049,  1.70465573]])
  • 1
arr = np.random.randn(3,4,5)
arr
  • 1
  • 2
array([[[ 0.10223077, -1.53873895, -0.99946213,  0.71598751,
         -0.90498114],
        [-0.01548156,  0.30273138,  0.34831772,  1.64086735,
          0.52801345],
        [-1.31620627, -0.79570758, -1.34854625, -2.63311809,
         -1.11911915],
        [-0.80136175, -1.94967438, -0.28787123,  0.33664872,
          0.16180744]],

       [[ 1.77507844, -0.6858868 , -0.53739313,  1.33779554,
          1.53855697],
        [ 1.9271013 ,  0.58314326, -0.73893003,  0.67052899,
         -0.00530868],
        [-0.19838128, -0.92396483, -0.72747217,  0.8346707 ,
          0.44643892],
        [-0.37615445,  1.8688799 , -0.55484319,  0.50585597,
         -0.26799842]],

       [[ 0.57238033, -0.17529308, -0.72637569, -2.89489543,
         -0.01108801],
        [-0.17406094, -0.79553743, -0.64445857, -1.0084828 ,
          0.59183829],
        [-0.60375821,  0.15761849,  0.25371104, -0.60639911,
         -1.20483347],
        [ 0.70185761, -0.90187431,  0.45284624, -1.09157387,
          0.70808834]]])
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
depth_means = arr.mean(2)
depth_means
  • 1
  • 2
array([[-0.52499279,  0.56088967, -1.44253947, -0.50809024],
       [ 0.6856302 ,  0.48730697, -0.11374173,  0.23514796],
       [-0.64705438, -0.40614029, -0.40073225, -0.0261312 ]])
  • 1
  • 2
  • 3
depth_means.shape
  • 1
(3, 4)
  • 1
demeaned = arr - depth_means[:,:,np.newaxis]
  • 1
demeaned.mean(2)
  • 1
array([[-2.22044605e-17,  4.44089210e-17, -2.22044605e-17,
        -2.22044605e-17],
       [-4.44089210e-17, -2.22044605e-17,  4.44089210e-17,
         0.00000000e+00],
       [-8.88178420e-17, -4.44089210e-17,  8.88178420e-17,
         2.22044605e-17]])
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
def demean_axis(arr,axis=0):
    means = arr.mean(axis)
    indexer = [slice(None)]*arr.ndim
    indexer[axis] = np.newaxis
    return arr - means[indexer]
  • 1
  • 2
  • 3
  • 4
  • 5

A.3.2 通过广播设定数组的值

arr = np.zeros((4,3))
arr[:] = 5
arr
  • 1
  • 2
  • 3
array([[5., 5., 5.],
       [5., 5., 5.],
       [5., 5., 5.],
       [5., 5., 5.]])
  • 1
  • 2
  • 3
  • 4
col = np.array([1.28,-0.42,0.44,1.6])
arr[:] = col[:,np.newaxis]
arr
  • 1
  • 2
  • 3
array([[ 1.28,  1.28,  1.28],
       [-0.42, -0.42, -0.42],
       [ 0.44,  0.44,  0.44],
       [ 1.6 ,  1.6 ,  1.6 ]])
  • 1
  • 2
  • 3
  • 4
arr[:2] = [[-1.37],[0.509]]
  • 1
arr
  • 1
array([[-1.37 , -1.37 , -1.37 ],
       [ 0.509,  0.509,  0.509],
       [ 0.44 ,  0.44 ,  0.44 ],
       [ 1.6  ,  1.6  ,  1.6  ]])
  • 1
  • 2
  • 3
  • 4

A.4 高阶ufunc用法

A.4.1 ufunc实例方法

  • NumPy的每个二元ufunc(通用函数)都有特殊的方法来执行某些特殊的向量化操作。
  • ufunc方法
方法描述
reduce (x)按操作的连续应用程序对数值聚合
accumulate (x)聚合值,保留所有部分聚合
reduceat (x,bins)“本地” 缩聚或“group by",减少连续的数据切片以生成聚合数组
arr = np.arange(10)
arr
  • 1
  • 2
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
  • 1
#reduce方法接收单个数组并通过执行一系列二元操作在可选的轴向上对数组的值进行聚合。
#起始值(对于add方法是0)取决于ufunc。如果传递了一个轴,则沿该轴执行缩聚。
np.add.reduce(arr)
  • 1
  • 2
  • 3
45
  • 1
arr.sum()
  • 1
45
  • 1
#使用np.logical_and来检查数组的每一行中的值是否被排序
np.random.seed(12346)
arr = np.random.randn(5,5)
arr
  • 1
  • 2
  • 3
  • 4
array([[-8.99822478e-02,  7.59372617e-01,  7.48336101e-01,
        -9.81497953e-01,  3.65775545e-01],
       [-3.15442628e-01, -8.66135605e-01,  2.78568155e-02,
        -4.55597723e-01, -1.60189223e+00],
       [ 2.48256116e-01, -3.21536673e-01, -8.48730755e-01,
         4.60468309e-04, -5.46459347e-01],
       [ 2.53915229e-01,  1.93684246e+00, -7.99504902e-01,
        -5.69159281e-01,  4.89244731e-02],
       [-6.49092950e-01, -4.79535727e-01, -9.53521432e-01,
         1.42253882e+00,  1.75403128e-01]])
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
#对行排序
arr[::2].sort(1)
  • 1
  • 2
arr[:,:-1]
  • 1
array([[-9.81497953e-01, -8.99822478e-02,  3.65775545e-01,
         7.48336101e-01],
       [-3.15442628e-01, -8.66135605e-01,  2.78568155e-02,
        -4.55597723e-01],
       [-8.48730755e-01, -5.46459347e-01, -3.21536673e-01,
         4.60468309e-04],
       [ 2.53915229e-01,  1.93684246e+00, -7.99504902e-01,
        -5.69159281e-01],
       [-9.53521432e-01, -6.49092950e-01, -4.79535727e-01,
         1.75403128e-01]])
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
arr[:,1:]
  • 1
array([[-8.99822478e-02,  3.65775545e-01,  7.48336101e-01,
         7.59372617e-01],
       [-8.66135605e-01,  2.78568155e-02, -4.55597723e-01,
        -1.60189223e+00],
       [-5.46459347e-01, -3.21536673e-01,  4.60468309e-04,
         2.48256116e-01],
       [ 1.93684246e+00, -7.99504902e-01, -5.69159281e-01,
         4.89244731e-02],
       [-6.49092950e-01, -4.79535727e-01,  1.75403128e-01,
         1.42253882e+00]])
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
arr[:,:-1] < arr[:,1:]
  • 1
array([[ True,  True,  True,  True],
       [False,  True, False, False],
       [ True,  True,  True,  True],
       [ True, False,  True,  True],
       [ True,  True,  True,  True]])
  • 1
  • 2
  • 3
  • 4
  • 5
#请注意,logical_and.reduce等价于
np.logical_and.reduce(arr[:,:-1] < arr[:,1:],axis=1)
  • 1
  • 2
array([ True, False,  True, False,  True])
  • 1
#accumulate与reduce是相关的,就像cumsum与sum相关一样。
#accumulate生成一个数组,其尺寸与中间“累计”值相同
arr = np.arange(15).reshape((3,5))
arr
  • 1
  • 2
  • 3
  • 4
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])
  • 1
  • 2
  • 3
np.add.reduce(arr,axis=1)
  • 1
array([10, 35, 60])
  • 1
np.add.accumulate(arr,axis=1)
  • 1
array([[ 0,  1,  3,  6, 10],
       [ 5, 11, 18, 26, 35],
       [10, 21, 33, 46, 60]], dtype=int32)
  • 1
  • 2
  • 3
np.add.reduce(arr,axis=0)
  • 1
array([15, 18, 21, 24, 27])
  • 1
np.add.accumulate(arr,axis=0)
  • 1
array([[ 0,  1,  2,  3,  4],
       [ 5,  7,  9, 11, 13],
       [15, 18, 21, 24, 27]], dtype=int32)
  • 1
  • 2
  • 3
#outer在两个数组之间执行成对的交叉乘积
arr = np.arange(3).repeat([1,2,2])
arr
  • 1
  • 2
  • 3
array([0, 1, 1, 2, 2])
  • 1
np.multiply.outer(arr,np.arange(5))
  • 1
array([[0, 0, 0, 0, 0],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 2, 4, 6, 8],
       [0, 2, 4, 6, 8]])
  • 1
  • 2
  • 3
  • 4
  • 5
#outer的输出的维度等于输入的维度总和
x,y = np.random.randn(3,4),np.random.randn(5)
x
  • 1
  • 2
  • 3
array([[-1.1049211 ,  0.7239073 , -0.95465401,  0.24438966],
       [-0.14528732, -0.12229477,  0.49165039, -1.55720967],
       [ 0.11172771, -0.26132992,  0.27843076, -0.10798888]])
  • 1
  • 2
  • 3
y
  • 1
array([ 0.11090105, -0.37904993,  2.60555583, -1.02235214,  0.26172618])
  • 1
result = np.subtract.outer(x,y)
result.shape
  • 1
  • 2
(3, 4, 5)
  • 1
#reduceat方法接受一系列的“箱体边缘”,这些箱体边缘表示如何分隔以及聚合数据值
#结果是在arr[0:5]、arr[5:8]和arr[8:]上执行了缩聚(此处是加和)
arr = np.arange(10)
print(arr)
np.add.reduceat(arr,[0,5,8])
  • 1
  • 2
  • 3
  • 4
  • 5
[0 1 2 3 4 5 6 7 8 9]





array([10, 18, 17], dtype=int32)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
#你可以传递一个axis参数
arr = np.multiply.outer(np.arange(4),np.arange(5))
arr
  • 1
  • 2
  • 3
array([[ 0,  0,  0,  0,  0],
       [ 0,  1,  2,  3,  4],
       [ 0,  2,  4,  6,  8],
       [ 0,  3,  6,  9, 12]])
  • 1
  • 2
  • 3
  • 4
np.add.reduceat(arr,[0,2,4],axis=1)
  • 1
array([[ 0,  0,  0],
       [ 1,  5,  4],
       [ 2, 10,  8],
       [ 3, 15, 12]], dtype=int32)
  • 1
  • 2
  • 3
  • 4

A.4.2 使用Python编写新的ufunc方法

  • numpy.frompyfunc函数接收一个具有特定数字输入和输出的函数。
def add_elements(x,y):
    return x+y
  • 1
  • 2
#frompyfunc(func, nin, nout, *[, identity])
#第一个参数函数名,第二个表示输入,第三个表示输出
add_them = np.frompyfunc(add_elements,2,1)
add_them
  • 1
  • 2
  • 3
  • 4
<ufunc 'add_elements (vectorized)'>
  • 1
add_them(np.arange(8),np.arange(8))
  • 1
array([0, 2, 4, 6, 8, 10, 12, 14], dtype=object)
  • 1
#另一个函数numpy.vectorize允许指定输出的类型(但功能稍差)
add_them = np.vectorize(add_elements,otypes=[np.float64])
add_them
  • 1
  • 2
  • 3
<numpy.vectorize at 0x15fa8974a60>
  • 1
add_them(np.arange(8),np.arange(8))
  • 1
array([ 0.,  2.,  4.,  6.,  8., 10., 12., 14.])
  • 1
arr = np.random.randn(10000)
%timeit add_them(arr,arr)
  • 1
  • 2
902 µs ± 6.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  • 1
#这种会快很多
%timeit np.add(arr,arr)
  • 1
  • 2
2.7 µs ± 2.16 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  • 1

A.5 结构化和记录数组

  • ndarray是一个同构数据的容器。也就是说,它表示一个内存块,其中每个元素占用相同数量的字节,由dtype确定
dtype = [('x',np.float64),('y',np.int32)]
sarr = np.array([(1.5,6),(np.pi,-2)],dtype=dtype)
sarr
  • 1
  • 2
  • 3
array([(1.5       ,  6), (3.14159265, -2)],
      dtype=[('x', '<f8'), ('y', '<i4')])
  • 1
  • 2
#一种典型的方式是使用(field_name, field_data_type)作为元组的列表
sarr[0]
  • 1
  • 2
(1.5, 6)
  • 1
sarr[0]['y']
  • 1
6
  • 1
sarr['x']
  • 1
array([1.5       , 3.14159265])
  • 1

A.5.1 嵌套dtype和多维字段

  • 当指定结构化的dtype时,你可以另外传递一个形状(以int或元组的形式)
dtype = [('x',np.float64,3),('y',np.int32)]
arr = np.zeros(4,dtype=dtype)
arr
  • 1
  • 2
  • 3
array([([0., 0., 0.], 0), ([0., 0., 0.], 0), ([0., 0., 0.], 0),
       ([0., 0., 0.], 0)], dtype=[('x', '<f8', (3,)), ('y', '<i4')])
  • 1
  • 2
#在这种情况下,x字段引用的是每条记录中长度为3的数组
arr[0]['x']
  • 1
  • 2
array([0., 0., 0.])
  • 1
arr['x']
  • 1
array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])
  • 1
  • 2
  • 3
  • 4
dtype = [('x',[('a','f8'),('b','f4')]),('y',np.int32)]
data = np.array([((1,2),5),((3,4),6)],dtype=dtype)
  • 1
  • 2
data['x']
  • 1
array([(1., 2.), (3., 4.)], dtype=[('a', '<f8'), ('b', '<f4')])
  • 1
data['y']
  • 1
array([5, 6])
  • 1
data['x']['a']
  • 1
array([1., 3.])
  • 1

A.5.2 为什么要使用结构化数组

  • 结构化数组提供了一种将内存块解释为具有任意复杂嵌套列的表格结构的方法。
  • 由于数组中的每个元素都在内存中表示为固定数量的字节,因此结构化数组提供了读/写磁盘(包括内存映射)数据,以及在网络上传输数据和其他此类用途的非常快速有效的方法。

A.6 更多关于排序的内容

  • ndarray的sort实例方法是一种原位排序,意味着数组的内容进行了重排列,而不是生成了一个新的数组
arr = np.random.randn(6)
arr
  • 1
  • 2
array([ 0.51034093, -1.21799778, -0.27034648, -1.33534252, -0.78528729,
       -1.10908521])
  • 1
  • 2
arr.sort()
  • 1
arr
  • 1
array([-1.33534252, -1.21799778, -1.10908521, -0.78528729, -0.27034648,
        0.51034093])
  • 1
  • 2
#在进行数组原位排序时,请记住如果数组是不同ndarray的视图的话,原始数组将会被改变
arr = np.random.randn(3,5)
arr
  • 1
  • 2
  • 3
array([[-0.00369513, -0.15297778, -0.46090167, -0.42008296, -0.91017112],
       [-1.05144731,  1.41433111,  0.22343751,  1.98200412, -0.11843381],
       [-1.71099598, -0.77901664,  1.9175701 , -0.36801273,  0.35893302]])
  • 1
  • 2
  • 3
#对第一列的值原位排序
arr[:,0].sort()
  • 1
  • 2
#只有第一列数据排序有变化
arr
  • 1
  • 2
array([[-1.71099598, -0.15297778, -0.46090167, -0.42008296, -0.91017112],
       [-1.05144731,  1.41433111,  0.22343751,  1.98200412, -0.11843381],
       [-0.00369513, -0.77901664,  1.9175701 , -0.36801273,  0.35893302]])
  • 1
  • 2
  • 3
#numpy.sort产生的是一个数组的新的、排序后的副本
arr = np.random.randn(5)
arr
  • 1
  • 2
  • 3
array([ 0.83175214,  0.0981957 , -0.16337765,  1.57507692,  1.20540736])
  • 1
np.sort(arr)
  • 1
array([-0.16337765,  0.0981957 ,  0.83175214,  1.20540736,  1.57507692])
  • 1
#经过np.sort()数组是不会改变的
arr
  • 1
  • 2
array([ 0.83175214,  0.0981957 , -0.16337765,  1.57507692,  1.20540736])
  • 1
#所有这些排序方法都有一个axis参数,用于独立地沿着传递的轴对数据部分进行排序
arr = np.random.randn(3,5)
arr
  • 1
  • 2
  • 3
array([[ 0.48623846,  1.40501429,  0.21771959, -0.6147521 , -1.03729051],
       [ 0.00466416,  1.31854631, -0.09256828, -1.03503114,  0.70669487],
       [-0.06967569, -0.55095404,  0.87325007, -1.9579896 , -0.10276109]])
  • 1
  • 2
  • 3
#按照行排序,会改变arr
arr.sort(axis=1)
  • 1
  • 2
arr
  • 1
array([[-1.03729051, -0.6147521 ,  0.21771959,  0.48623846,  1.40501429],
       [-1.03503114, -0.09256828,  0.00466416,  0.70669487,  1.31854631],
       [-1.9579896 , -0.55095404, -0.10276109, -0.06967569,  0.87325007]])
  • 1
  • 2
  • 3
arr.sort(axis=0)
  • 1
#按照行排序,会改变arr
arr
  • 1
  • 2
array([[-1.9579896 , -0.6147521 , -0.10276109, -0.06967569,  0.87325007],
       [-1.03729051, -0.55095404,  0.00466416,  0.48623846,  1.31854631],
       [-1.03503114, -0.09256828,  0.21771959,  0.70669487,  1.40501429]])
  • 1
  • 2
  • 3
  • 你可能会注意到所有的排序方法都没有降序排列的选项。
  • 这是一个实践中的问题,因为数组切片会产生视图,因此不需要生成副本也不需要任何计算工作。
arr[:,::-1]
  • 1
array([[ 0.87325007, -0.06967569, -0.10276109, -0.6147521 , -1.9579896 ],
       [ 1.31854631,  0.48623846,  0.00466416, -0.55095404, -1.03729051],
       [ 1.40501429,  0.70669487,  0.21771959, -0.09256828, -1.03503114]])
  • 1
  • 2
  • 3

A.6.1 间接排序:argsort和lexsort

  • pandas的方法,比如Series和DataFrame的sort_values方法是对这些方法的变相实现(这些方法也必须要考虑缺失值)
values = np.array([5,0,1,3,2])
indexer = values.argsort()
indexer
  • 1
  • 2
  • 3
array([1, 2, 4, 3, 0], dtype=int64)
  • 1
values[indexer]
  • 1
array([0, 1, 2, 3, 5])
  • 1
#对一个二维数组按照它的第一行进行重新排序
arr = np.random.randn(3,5)
arr[0] = values
arr
  • 1
  • 2
  • 3
  • 4
array([[ 5.        ,  0.        ,  1.        ,  3.        ,  2.        ],
       [ 1.01782863, -1.18082614,  0.66861266, -1.51142124, -0.91934196],
       [ 1.16468714,  0.12410901,  1.69151564,  0.8931546 ,  0.16763928]])
  • 1
  • 2
  • 3
arr[:,arr[0].argsort()]
  • 1
array([[ 0.        ,  1.        ,  2.        ,  3.        ,  5.        ],
       [-1.18082614,  0.66861266, -0.91934196, -1.51142124,  1.01782863],
       [ 0.12410901,  1.69151564,  0.16763928,  0.8931546 ,  1.16468714]])
  • 1
  • 2
  • 3
#lexsort类似于argsort,但它对多键数组执行间接字典排序
first_name = np.array(['Bob','Jane','Steve','Bill','Barbara'])
last_name = np.array(['Jone','Arnold','Arnold','Jone','Walters'])
sorter = np.lexsort((first_name,last_name))
sorter
  • 1
  • 2
  • 3
  • 4
  • 5
array([1, 2, 3, 0, 4], dtype=int64)
  • 1
first_name[sorter]
  • 1
array(['Jane', 'Steve', 'Bill', 'Bob', 'Barbara'], dtype='<U7')
  • 1
last_name[sorter]
  • 1
array(['Arnold', 'Arnold', 'Jone', 'Jone', 'Walters'], dtype='<U7')
  • 1
zip(first_name[sorter],last_name[sorter])
  • 1
<zip at 0x15fa89182c0>
  • 1

A.6.2 其他的排序算法

种类速度是否稳定工作空间最差情况
quicksort1No00 (n^2)
mergesort2Yesn/20(n 1og n)
heapsort3No0(n 1og n)
values = np.array(['2:first','2:second','1:first','1:second','1:third'])
key = np.array([2,2,1,1,1])
indexer = key.argsort(kind='mergesort')
indexer
  • 1
  • 2
  • 3
  • 4
array([2, 3, 4, 0, 1], dtype=int64)
  • 1
values.take(indexer)
  • 1
array(['1:first', '1:second', '1:third', '2:first', '2:second'],
      dtype='<U8')
  • 1
  • 2

A.6.3 数组的部分排序

  • 排序的目标之一可以是确定数组中最大或最小的元素。
  • NumPy已经优化了方法numpy. partition和np.argpartition,用于围绕第k个最小元素对数组进行分区
np.random.seed(12345)
arr = np.random.randn(20)
arr
  • 1
  • 2
  • 3
array([-0.20470766,  0.47894334, -0.51943872, -0.5557303 ,  1.96578057,
        1.39340583,  0.09290788,  0.28174615,  0.76902257,  1.24643474,
        1.00718936, -1.29622111,  0.27499163,  0.22891288,  1.35291684,
        0.88642934, -2.00163731, -0.37184254,  1.66902531, -0.43856974])
  • 1
  • 2
  • 3
  • 4
#在调用partition(arr,3)之后,结果中的前三个元素是最小的三个值,并不是特定的顺序。
np.partition(arr,3)
  • 1
  • 2
array([-2.00163731, -1.29622111, -0.5557303 , -0.51943872, -0.37184254,
       -0.43856974, -0.20470766,  0.28174615,  0.76902257,  0.47894334,
        1.00718936,  0.09290788,  0.27499163,  0.22891288,  1.35291684,
        0.88642934,  1.39340583,  1.96578057,  1.66902531,  1.24643474])
  • 1
  • 2
  • 3
  • 4
#numpy.argpartition类似于numpy.argsort排序,它返回的是将数据重新排列为等价顺序的索引
indices = np.argpartition(arr,3)
indices
  • 1
  • 2
  • 3
array([16, 11,  3,  2, 17, 19,  0,  7,  8,  1, 10,  6, 12, 13, 14, 15,  5,
        4, 18,  9], dtype=int64)
  • 1
  • 2
arr.take(indices)
  • 1
array([-2.00163731, -1.29622111, -0.5557303 , -0.51943872, -0.37184254,
       -0.43856974, -0.20470766,  0.28174615,  0.76902257,  0.47894334,
        1.00718936,  0.09290788,  0.27499163,  0.22891288,  1.35291684,
        0.88642934,  1.39340583,  1.96578057,  1.66902531,  1.24643474])
  • 1
  • 2
  • 3
  • 4

A.6.4 numpy.searchsorted:在已排序的数组寻找元素

  • searchsorted是一个数组方法,它对已排序数组执行二分搜索,返回数组中需要插入值的位置以保持排序
arr = np.array([0,1,7,12,15])
arr.searchsorted(9)
  • 1
  • 2
3
  • 1
#你还可以传递一个值数组来获取一个索引数组
#对于0元素,searchsorted返回0。这是因为默认行为是返回一组相等值左侧的索引
arr.searchsorted([0,8,11,16])
  • 1
  • 2
  • 3
array([0, 3, 3, 5], dtype=int64)
  • 1
arr = np.array([0,0,0,1,1,1,1])
arr.searchsorted([0,1])
  • 1
  • 2
array([0, 3], dtype=int64)
  • 1
arr.searchsorted([0,1],side='right')
  • 1
array([3, 7], dtype=int64)
  • 1
#作为searchsorted的另一个应用,假设我们有一个介于0和10,000之间的数值,
#以及我们想用来分隔数据的单独的“桶边界”数组
data = np.floor(np.random.uniform(0,10000,size=50))
data
  • 1
  • 2
  • 3
  • 4
array([9940., 6768., 7908., 1709.,  268., 8003., 9037.,  246., 4917.,
       5262., 5963.,  519., 8950., 7282., 8183., 5002., 8101.,  959.,
       2189., 2587., 4681., 4593., 7095., 1780., 5314., 1677., 7688.,
       9281., 6094., 1501., 4896., 3773., 8486., 9110., 3838., 3154.,
       5683., 1878., 1258., 6875., 7996., 5735., 9732., 6340., 8884.,
       4954., 3516., 7142., 5039., 2256.])
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
bins = np.array([0,100,1000,5000,100000])
labels = bins.searchsorted(data)
labels
  • 1
  • 2
  • 3
array([4, 4, 4, 3, 2, 4, 4, 2, 3, 4, 4, 2, 4, 4, 4, 4, 4, 2, 3, 3, 3, 3,
       4, 3, 4, 3, 4, 4, 4, 3, 3, 3, 4, 4, 3, 3, 4, 3, 3, 4, 4, 4, 4, 4,
       4, 3, 3, 4, 4, 3], dtype=int64)
  • 1
  • 2
  • 3
#可以和pandas的groupby一起被用于分箱数据
pd.Series(data).groupby(labels).count()
  • 1
  • 2
2     4
3    18
4    28
dtype: int64
  • 1
  • 2
  • 3
  • 4

A.7 使用Numba编写快速NumPy函数

  • Numba(http://numba.pydata.org)是一个开源项目,可为使用CPU、GPU或其他硬件的NumPy类型的数据创建快速函数。
  • Numba不能编译所有的Python代码,但它支持纯Python代码的重要子集,这些代码对于编写数值算法最为有用
#该函数使用for循环计算表达式(x - y).mean()的值
import numpy as np
def mean_distance(x,y):
    nx = len(x)
    result = 0.0
    count = 0
    for i in range(nx):
        result+= x[i] - y[i]
        count+=1
    return result/count
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
x = np.random.randn(1000000)
y = np.random.randn(1000000)
%timeit mean_distance(x,y)
  • 1
  • 2
  • 3
232 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • 1
%timeit (x-y).mean()
  • 1
1.51 ms ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  • 1
#可以使用numba.jit函数将这个函数编译成Numba函数
import numba as nb
numba_mean_distance = nb.jit(mean_distance)
  • 1
  • 2
  • 3
#可以写成装饰器的形式
@nb.jit
def mean_distance(x,y):
    nx = len(x)
    result = 0.0
    count = 0
    for i in range(nx):
        result+= x[i] - y[i]
        count+=1
    return result/count
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
%timeit numba_mean_distance(x,y)
  • 1
670 µs ± 48.8 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • 1
from numba import float64,njit
@njit(float64(float64[:],float64[:]))
def mean_distance(x,y):
    return (x-y).mean()
  • 1
  • 2
  • 3
  • 4

A.7.1 使用Numba创建自定义numpy.ufunc对象

  • numba.vectorize函数创建了编译好的NumPy ufunc,其行为也和内建的ufunc类似。
from numba import vectorize
@vectorize
def nb_add(x,y):
    return x+y
  • 1
  • 2
  • 3
  • 4
x = np.arange(10)
nb_add(x,x)
  • 1
  • 2
array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18], dtype=int64)
  • 1

A.8 高阶数组输入和输出

A.8.1 内存映射文件

  • 内存映射文件是一种与磁盘上的二进制数据交互的方法,就像它是存储在内存数组中一样
  • NumPy实现了一个memmap对象,它是ndarray型的,允许对大型文件以小堆栈的方式进行读取和写入,而无须将整个数组载入内存。
  • 此外,memmap还有和内存数组相同的方法,因此可以替代很多算法中原本要填入的ndarray。
mmap = np.memmap('mymmap',dtype='float64',mode='w+',shape=(10000,10000))
mmap
  • 1
  • 2
memmap([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
#对memmap切片返回的是硬盘上数据的视图
section = mmap[:5]
section
  • 1
  • 2
  • 3
memmap([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])
  • 1
  • 2
  • 3
  • 4
  • 5
#如果你将数据赋值给这些切片,它将会在内存中缓冲(类似于一个Python文件对象),但你可以调用flush将数据写入硬盘
section[:] = np.random.randn(5,10000)
mmap.flush()
  • 1
  • 2
  • 3
mmap
  • 1
memmap([[ 0.41110843,  0.58204806,  1.2463012 , ...,  0.06582078,
         -0.34734378,  0.62280733],
        [-2.21583571,  0.29678775,  0.57086919, ...,  0.07007184,
         -0.26204433, -0.30061136],
        [ 0.77817885,  0.74008809,  0.49653126, ..., -0.51072764,
          1.11806763,  0.09285284],
        ...,
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ]])
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
del mmap
  • 1
mmap = np.memmap('mymmap',dtype='float64',shape=(10000,10000))
mmap
  • 1
  • 2
memmap([[ 0.41110843,  0.58204806,  1.2463012 , ...,  0.06582078,
         -0.34734378,  0.62280733],
        [-2.21583571,  0.29678775,  0.57086919, ...,  0.07007184,
         -0.26204433, -0.30061136],
        [ 0.77817885,  0.74008809,  0.49653126, ..., -0.51072764,
          1.11806763,  0.09285284],
        ...,
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        , ...,  0.        ,
          0.        ,  0.        ]])
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

A.8.2 HDF5和其他数组存储选择

  • PyTables和h5py是两个为NumPy提供友好接口的Python项目,用于以高效和可压缩的HDF5格式存储数组数据(HDF代表分层数据格式,Hierarchical Data Format)

A.9 性能技巧

  • 注意事项
    • 将Python循环和条件逻辑转换为数组操作和布尔数组操作
    • 尽可能使用广播
    • 使用数组视图(切片)来避免复制数据
    • 使用ufunc和ufunc方法

A.9.1 连续内存的重要性

arr_c = np.ones((1000,1000),order='c')
arr_f = np.ones((1000,1000),order='F')
arr_c.flags
  • 1
  • 2
  • 3
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
arr_f.flags
  • 1
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
%timeit arr_c.sum(1)
  • 1
241 µs ± 1.42 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  • 1
%timeit arr_f.sum(1)
  • 1
238 µs ± 589 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  • 1
arr_f.copy('C').flags
  • 1
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
#在数组上构建视图时,请记住结果并不能保证是连续的
arr_c[:50].flags.contiguous
  • 1
  • 2
True
  • 1
arr_c[:,:50].flags
  • 1
  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

附录B更多IPython系统相关内容

  • IPython维护一个小的磁盘数据库,其中包含你执行的每条命令的文本。这些文本有多种用途:
    • 以最少的打字搜索、完成并执行先前执行过的命令
    • 在会话之间保持命令历史记录
    • 将输入/输出历史日志,记录到文件

B.1.1 搜索和复用命令历史

B.1.2 输入和输出变量

  • 前两个输出分别存储在_(一个下划线)和__(两个下划线)变量中
2**27
  • 1
134217728
  • 1
_
  • 1
134217728
  • 1
  • 输入变量存储在名为_iX的变量中,其中X是输入行号。对于每个输入变量,都有一个对应的输出变量_X。
foo = 'bar'
foo
  • 1
  • 2
'bar'
  • 1
_i137
  • 1
"foo = 'bar'\nfoo"
  • 1
_137
  • 1
'bar'
  • 1
#由于输入变量是字符串,因此可以使用Python exec关键字再次执行它们
exec(_i27)
  • 1
  • 2
  • % hist可以用包含或不包含行号的形式打印全部或部分输入历史记录。
  • % reset用于清除交互式命名空间以及可选的输入和输出缓存。
  • % xdel魔术函数用于从IPython机器中移除对特定对象的所有引用。
%hist
  • 1

B.2 与操作系统交互

命令描述
! cmd在系统命令行中执行cmd命令
output = !cmd args运行cmd并在output中保存stdout
%alias alias_ name cmd为系统(shel) 命令定义别名
%bookmark使用IPython的目录书签系统
%cd directory将系统工作目录更改为传递的目录
%pwd返回当前工作目录
%pushd directory将当前目录放在堆栈上并更改为目标目录
%popd切换到堆栈顶部弹出的目录
%dirs返回包含当前目录堆栈的列表
%dhist打印访问目录的历史记录
%env以字典形式返回系统环境变量
%matplotlib配置matplotlib集成选项

B.2.1 shell命令及其别名

  • 通过将以!转义的表达式赋值给变量,你可以把命令行的shell输出存储在一个变量中。
  • % alias魔术函数可以为shell命令定义自定义快捷键
  • 你会注意到IPython会在会话关闭后“忘记”所有你在交互中定义的别名。要创建永久别名,你需要使用配置系统。

B.2.2 目录书签系统

  • %bookmark
  • 使用%bookmark和-l选项,将列出你所有的书签

B.3 软件开发工具

B.3.1 交互式调试器

  • Python调试器命令
命令动作
h(e1p)展示命令列表
help command显示conmand命令的文档
c(continue)恢复程序执行
q(uit)退出调试器而不再执行更多的代码
b(reak )number在当前文件的number位置设置断点
b path/ to/file. py:number在指定文件的number位置设置断点
s(tep)单步进入函数调用
n(ext)执行当前行,并进入到当前层级的下一行
u§/d( own)在函数调用堆栈中上下移动
a(rgs)显示当前函数的参数
debug statement在新的(递归)调试器中调用语句statement
l(ist)statement显示当前堆栈的当前位置和上下文
W(here)在当前位置打印带有上下文的完整堆栈回溯
B.3.1.1 调试器的其他用途
  • 第一个函数set_trace是非常简单的。你可以在代码的任何部分使用set_trace来临时停止,以便更仔细地检查代码
  • 按c(continue,继续)将导致代码恢复正常,不会造成任何损害。

B.3.2 对代码测时:%time和%timeit

  • %time一次运行一条语句,并报告总执行时间。
  • 给定任意的语句,%timeit有多次运行语句以产生更准确的平均运行时间的功能

B.3.3 基础分析:%prun和%run -p

B.3.4 逐行分析函数

B.4 使用IPython进行高效代码开发的技巧

B.4.1 重载模块依赖项

B.4.2 代码设计技巧

B.4.2.1 保持相关对象和数据的存在
B.4.2.2 扁平优于嵌套
B.4.2.3 克服对长文件的恐惧

B.5 高阶IPython特性

B.5.1 使你自定义的类对IPython友好

B.5.2 配置文件与配置

  • 下面这些事情都可以通过配置来完成:
    • 更改颜色主题
    • 更改输入输出的外观,或者去除Out之后和下一个In之前的空白行
    • 执行任意的Python语句列表(例如,导入你总是使用的库,或者是其他你希望每次你启动IPython就运行的程序)
    • 始终启用IPython扩展,如line_profiler中的% lprun魔术函数
    • 激活Jupyter拓展
    • 自定义魔术函数或系统别名
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/凡人多烦事01/article/detail/181212
推荐阅读
相关标签
  

闽ICP备14008679号