赞
踩
当使用距离评估的时候,有些数据取值范围比较大,有些比较小。
比如:特征A 房间面积为 70、100、120, 特征B 房间个数为 3,4,5。A,B 值范围差异比较大,如果只根据数值大小来计算,会非常不合理,所以需要进行数据标准化或归一化的操作。
数据归一化:将所有数据映射到等比例空间
(同一个尺度中)。
量纲:连续变量进行预处理,进行数据标准化,对于无序的分类变量,需要生成哑变量
。
最值归一化:就是常说的 0-1 归一化
。
英文:Min-Max scaling,或者叫做 normalization。
是最常用的归一化方式。
方法:压缩到 0–1 的区间上,这样做可以抑制 离群值
对结果的影响。
x s c a l e = x − x m i n x m a x − x m i n x_{scale} = \frac{x - x_{min}}{x_{max} - x_{min}} xscale=xmax−xminx−xmin
适用于:分布有明显边界
的情况;如分数(0–100分),颜色(0–255);
缺点:受 outlier 影响比较大。比如收入范围分布很广。
均值方差归一化,也可以称为标准化
。
英文为 standardization 或 Z-score normailization。
方法:将所有数据归一到 均值为0 方差为1 的分布中。
计算公式:
x
s
c
a
l
e
=
x
−
x
m
e
a
n
s
x_{scale} = \frac{x - x_{mean}}{s}
xscale=sx−xmean
也可记作:
z
=
x
−
μ
σ
z=\frac{x-\mu}{\sigma}
z=σx−μ
均值为
μ
=
0
\mu=0
μ=0 ,标准差
σ
=
1
\sigma=1
σ=1;
结果数据形态上以原点为对称,特征的取值范围都差不多。
适用于:数据没有明显的边界,有可能存在极端数据值。
是否让 训练数据 和 测试数据 各自做归一化?
正确做法:将测试数据集 使用训练数据集 得到的 mean_train 和 std_train 来进行归一化:
(x_test - mean_train) / std_train
这样做的原因:
sklearn 中的归一化 StandardScaler
https://scikit-learn.org/0.20/modules/generated/sklearn.preprocessing.StandardScaler.html
import numpy as np x = np.random.randint(0, 100, size = 100) x ''' array([29, 45, 97, 97, 5, 56, 36, 74, 50, 26, 19, 48, 36, 36, 23, 3, 63, 10, 85, 74, 89, 24, 93, 11, 78, 79, 4, 75, 10, 86, 85, 39, 28, 24, 32, 42, 74, 75, 15, 51, 98, 66, 69, 74, 88, 78, 33, 3, 28, 68, 93, 25, 88, 91, 60, 59, 74, 90, 22, 12, 14, 17, 61, 59, 15, 28, 52, 6, 97, 7, 75, 87, 58, 21, 37, 68, 26, 18, 49, 93, 3, 82, 57, 2, 92, 18, 51, 32, 62, 10, 76, 90, 44, 60, 38, 14, 92, 70, 3, 56]) ''' (x - np.min(x)) / (np.max(x) - np.min(x)) ''' array([0.28125 , 0.44791667, 0.98958333, 0.98958333, 0.03125 , 0.5625 , 0.35416667, 0.75 , 0.5 , 0.25 , 0.17708333, 0.47916667, 0.35416667, 0.35416667, 0.21875 , 0.01041667, 0.63541667, 0.08333333, 0.86458333, 0.75 , 0.90625 , 0.22916667, 0.94791667, 0.09375 , 0.79166667, 0.80208333, 0.02083333, 0.76041667, 0.08333333, 0.875 , 0.86458333, 0.38541667, 0.27083333, 0.22916667, 0.3125 , 0.41666667, 0.75 , 0.76041667, 0.13541667, 0.51041667, 1. , 0.66666667, 0.69791667, 0.75 , 0.89583333, 0.79166667, 0.32291667, 0.01041667, 0.27083333, 0.6875 , 0.94791667, 0.23958333, 0.89583333, 0.92708333, 0.60416667, 0.59375 , 0.75 , 0.91666667, 0.20833333, 0.10416667, 0.125 , 0.15625 , 0.61458333, 0.59375 , 0.13541667, 0.27083333, 0.52083333, 0.04166667, 0.98958333, 0.05208333, 0.76041667, 0.88541667, 0.58333333, 0.19791667, 0.36458333, 0.6875 , 0.25 , 0.16666667, 0.48958333, 0.94791667, 0.01041667, 0.83333333, 0.57291667, 0. , 0.9375 , 0.16666667, 0.51041667, 0.3125 , 0.625 , 0.08333333, 0.77083333, 0.91666667, 0.4375 , 0.60416667, 0.375 , 0.125 , 0.9375 , 0.70833333, 0.01041667, 0.5625 ]) '''
X = np.random.randint(0, 100, (50, 2)) X[:10,:] ''' array([[23, 47], [18, 6], [11, 0], [92, 76], [78, 57], [66, 7], [52, 78], [97, 83], [58, 48], [65, 88]]) ''' # 将 X 转为浮点数 X = np.array(X, dtype = float) # 对第一列数据进行特征归一化 X1 = X[:,0] X11 = (X1 - np.min(X1))/ (np.max(X1) - np.min(X1)) X11 ''' array([0.23232323, 0.18181818, 0.11111111, 0.92929293, 0.78787879, 0.66666667, 0.52525253, 0.97979798, 0.58585859, 0.65656566, 0.07070707, 0.94949495, 0.70707071, 0.33333333, 0.11111111, 0.33333333, 0.90909091, 0.58585859, 0.93939394, 0.38383838, 0.13131313, 1. , 0.46464646, 0.98989899, 0.87878788, 0.58585859, 0.11111111, 0.56565657, 0.19191919, 0.05050505, 0.54545455, 0.32323232, 0.92929293, 0.82828283, 0.93939394, 0.11111111, 0.8989899 , 0.13131313, 0.54545455, 0.13131313, 0.57575758, 0.62626263, 0.95959596, 0.1010101 , 0.82828283, 0. , 0.96969697, 0.02020202, 0.5959596 , 0.66666667]) ''' X2 = X[:,1] X21 = (X2 - np.min(X2))/ (np.max(X2) - np.min(X2)) X21 ''' array([0.47474747, 0.06060606, 0. , 0.76767677, 0.57575758, 0.07070707, 0.78787879, 0.83838384, 0.48484848, 0.88888889, 0.81818182, 0.01010101, 0.1010101 , 0.24242424, 0.17171717, 0.18181818, 0.11111111, 0.05050505, 0.11111111, 0.63636364, 0.71717172, 0.87878788, 1. , 0.28282828, 0.63636364, 0.02020202, 0.5959596 , 0.91919192, 0.60606061, 0.43434343, 0.05050505, 0.95959596, 0.42424242, 0.66666667, 0.57575758, 0.65656566, 0.22222222, 0.33333333, 0.61616162, 0.67676768, 0.12121212, 0.12121212, 0.67676768, 0.87878788, 0.97979798, 0.64646465, 0.09090909, 0.41414141, 0.08080808, 0.61616162]) ''' # 查看标准差 print('均值1:', np.mean(X11)) print('方差1:', np.std(X11)) print('均值2:', np.mean(X21)) print('方差2:', np.std(X21)) ''' 均值1: 0.5335353535353535 方差1: 0.3290554389302859 均值2: 0.4656565656565656 方差2: 0.31417389159473386 ''' import matplotlib.pyplot as plt plt.scatter(X11, X21) # 可以看到数值都在 0--1 之间
# (X - np.min(X)) / (np.max(X) - np.min(X)) X12 = (X1 - np.mean(X1))/ np.std(X1) X22 = (X2 - np.mean(X2))/ np.std(X2) print('均值1:', np.mean(X12)) print('方差1:', np.std(X12)) print('均值2:', np.mean(X22)) print('方差2:', np.std(X22)) # 均值为0,方差为1 ''' 均值1: -8.881784197001253e-18 方差1: 1.0 均值2: -4.5519144009631415e-17 方差2: 0.9999999999999999 ''' plt.scatter(X12, X22)
sklearn 使用 Scaler 来处理归一化,封装方法来尽量像其他算法
from sklearn import datasets iris = datasets.load_iris() X = iris.data y = iris.target from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42) from sklearn.preprocessing import MinMaxScaler, StandardScaler # 均值方差归一化 std_scaler = StandardScaler() std_scaler.fit(X_train) # StandardScaler(copy=True, with_mean=True, with_std=True) # 查看均值;特征矩阵四个特征对应的均值 std_scaler.mean_ # array([5.80916667, 3.06166667, 3.72666667, 1.18333333]) # 查看方差;原来版本使用 std_ , 后面替换为 scale_ , 意思是 描述数据分布范围; std_scaler.scale_ # array([0.82036535, 0.44724776, 1.74502786, 0.74914766]) X_train_std = std_scaler.transform(X_train) X_train_std ''' array([[-1.47393679, 1.20365799, -1.56253475, -1.31260282], [-0.13307079, 2.99237573, -1.27600637, -1.04563275], [ 1.08589829, 0.08570939, 0.38585821, 0.28921757], ... [-0.01117388, -1.0322392 , 0.15663551, 0.02224751], [ 1.57348593, -0.13788033, 1.24544335, 1.22361279]]) ''' X_test_std = std_scaler.transform(X_test) X_test_std ''' array([[ 0.35451684, -0.58505976, 0.55777524, 0.02224751], [-0.13307079, 1.65083742, -1.16139502, -1.17911778], ... [-1.23014297, -0.13788033, -1.33331205, -1.17911778], [-1.23014297, 0.08570939, -1.2187007 , -1.31260282]]) '''
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train_std, y_train)
# KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=3, p=2, weights='uniform')
knn_clf.score(X_test_std, y_test) # 注意上面 X_train 是 std,这里的 X_test 也需要使用 std 之后的
# 1.0
from sklearn.preprocessing import StandardScaler, MinMaxScaler
std_scaler = StandardScaler().fit(df[['alcohol', 'acid']])
df_std = std_scaler.transform(df[['alcohol', 'acid']])
minmax_scaler = StandardScaler().fit( df[['alcohol', 'acid']] )
df_minmax = minmax_scaler.transform( df[['alcohol', 'acid']] )
import numpy as np class StandardScaler: def __init__(self): self.mean_ = None self.scale_ = None def fit(self, X): """根据训练数据集X获得数据的均值和方差""" assert X.ndim == 2, "The dimension of X must be 2" self.mean_ = np.array([np.mean(X[:,i]) for i in range(X.shape[1])]) self.scale_ = np.array([np.std(X[:,i]) for i in range(X.shape[1])]) return self # 将X根据这个StandardScaler进行均值方差归一化处理 def transform(self, X): """""" # 只处理二维的数据 assert X.ndim == 2, "The dimension of X must be 2" assert self.mean_ is not None and self.scale_ is not None, "must fit before transform!" assert X.shape[1] == len(self.mean_), "the feature number of X must be equal to mean_ and std_" resX = np.empty(shape=X.shape, dtype=float) for col in range(X.shape[1]): resX[:,col] = (X[:,col] - self.mean_[col]) / self.scale_[col] return resX
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。