当前位置:   article > 正文

数据预处理简介

数据预处理概念

最终数据科学指南(ULTIMATE DATA SCIENCE GUIDE)

Data is a collection of facts and figures, observations, or descriptions of things in an unorganized or organized form. Data can exist as images, words, numbers, characters, videos, audios, and etcetera.

数据是事实,数字,观察结果或事物描述的集合,它们是无组织的或有组织的形式。 数据可以图像,文字,数字,字符,视频,音频等形式存在。

什么是数据预处理 (What is data preprocessing)

To analyze our data and extract the insights out of it, it is necessary to process the data before we start building up our machine learning model i.e. we need to convert our data in the form which our model can understand. Since the machines cannot understand data in the form of images, audios, etc.

为了分析我们的数据并从中提取见解,有必要在开始建立机器学习模型之前对数据进行处理,即我们需要以模型可以理解的形式转换数据。 由于机器无法理解图像,音频等形式的数据。

Data is processed in the form (an efficient format) that it can be easily interpreted by the algorithm and produce the required output accurately.

数据以一种格式(一种有效的格式)进行处理,该格式可以很容易地被算法解释并准确地产生所需的输出。

The data we use in the real world is not perfect and it is incomplete, inconsistent (with outliers and noisy values), and in an unstructured form. Preprocessing the raw data helps to organize, scaling, clean (remove outliers), standardize i.e. simplifying it to feed the data to the machine learning algorithm.

我们在现实世界中使用的数据不是完美的,并且是不完整的,不一致的(具有异常值和嘈杂的值)以及非结构化形式。 预处理原始数据有助于组织,缩放,清理(去除异常值),标准化,即简化数据以将数据馈送到机器学习算法。

The process of data preprocessing involves a few steps:

数据预处理过程涉及几个步骤:

  • Data cleaning: the data we use may have some missing points (like rows or columns which does not contain any values) or have noisy data (irrelevant data that is difficult to interpret by the machine). To solve the above problems we can delete the empty rows and columns or fill them with some other values and we can use methods like regression and clustering for noisy data.

    数据清理:我们使用的数据可能缺少一些点(例如不包含任何值的行或列)或具有嘈杂的数据(机器难以解释的不相关数据)。 为了解决上述问题,我们可以删除空的行和列或用其他值填充它们,并且可以对嘈杂的数据使用诸如回归和聚类的方法。

  • Data transformation: this the process of transforming the raw data into the format that is ready to suitable for the model. It may include steps like- categorical encoding, scaling, normalization, standardization, etc.

    数据转换:这是将原始数据转换为适合模型的格式的过程。 它可能包括分类编码,缩放,标准化,标准化等步骤。

  • Data reduction: this helps to reduce the size of the data we are working on (for easy analysis) while maintaining the integrity of the original data.

    数据缩减:这有助于减少我们正在处理的数据的大小(以便于分析),同时保持原始数据的完整性。

用于数据预处理的Scikit学习库 (Scikit-learn library for data preprocessing)

Scikit-learn is a popular machine learning library available as an open-source. This library provides us various essential tools including algorithms for random forests, classification, regression, and of course for data preprocessing as well. This library is built on the top of NumPy and SciPy and it is easy to learn and understand.

Scikit-learn是一个流行的机器学习库,可以作为开放源代码使用。 该库为我们提供了各种基本工具,包括用于随机森林,分类,回归的算法,当然还包括数据预处理的算法。 该库建立在NumPy和SciPy的顶部,并且易于学习和理解。

We can use the following code to import the library in the workspace:

我们可以使用以下代码将库导入工作空间中:

import sklearn

For including the features for preprocessing we can use the following code:

为了包括预处理功能,我们可以使用以下代码:

from sklearn import preprocessing

In this article, we will be focussing on some essential data preprocessing features like standardization, normalization, categorical encoding, discretization, imputation of missing values, generating polynomial features, and custom transformers.

在本文中,我们将重点介绍一些必要的数据预处理功能,例如标准化,规范化,分类编码,离散化,缺失值的插补,生成多项式特征自定义转换器。

So, now let’s get started with these functions!

因此,现在让我们开始使用这些功能!

标准化 (Standardization)

Standardization is a technique used to scale the data such that the mean of the data becomes zero and the standard deviation becomes one. Here the values are not restricted to a particular range. We can use standardization when features of input data set have large differences between their ranges.

标准化是一种用于缩放数据以使数据均值变为零且标准偏差变为1的技术。 此处的值不限于特定范围。 当输入数据集的特征之间的差异较大时,可以使用标准化。

Formula for standardization of data
Author) The formula for standardization of data 作者提供的图片)数据标准化的公式

Let us consider the following example:

让我们考虑以下示例:

from sklearn import preprocessingimport numpy as npx = np.array([[1, 2, 3],[ 4,  5,  6],[ 7,  8, 9]])y_scaled = preprocessing.scale(X_train)print(y_scaled)

Here we have an input array of dimension 3x3 with its values ranging from one to nine. Using the scalefunction available in the preprocessingwe can quickly scale our data.

在这里,我们有一个尺寸为3x3的输入数组,其值范围为1到9。 使用preprocessing可用的scale功能,我们可以快速缩放数据。

data preprocessing standardization
Author) Scaled data 作者提供)缩放数据

There is another function available in this library StandardScaler, this helps us to compute mean and standard deviation to the training set of data and reapplying the same transformation to the training dataset by implementing the Transformer API .

该库StandardScaler还有另一个功能,它可以帮助我们计算出训练数据集的均值和标准差,并通过实现Transformer API将相同的变换重新应用于训练数据集。

If we want to scale our features in a given range we can use the MinMaxScaler(using parameter feature_range=(min,max)) or MinAbsScaler(the difference is that the maximum absolute value of each feature is scaled to unit size in MinAbsScaler)

如果要在给定范围内MinMaxScaler可以使用MinMaxScaler (使用参数feature_range=(min,max) )或MinAbsScaler (不同之处在于,每个要素的最大绝对值都按MinAbsScaler缩放为单位大小)

from sklearn.preprocessing import MinMaxScalerimport numpy as npx = MinMaxScaler(feature_range=(0,8))y = np.array([[1, 2, 3],[ 4,  -5,  -6],[ 7,  8, 9]])scale = x.fit_transform(y)scale

Here the values of an array of dimension 3x3 are scaled in a given range of (0,8)and we have used the .fit_transform()function which will help us to apply the same transformation to another dataset later.

在这里,尺寸为3x3的数组的值在给定的(0,8)范围内缩放,我们使用了.fit_transform()函数,该函数将帮助我们稍后将相同的变换应用于另一个数据集。

standardization in data preprocessing
Author) Scaled data in a specified range 作者提供的图像)指定范围内的缩放数据

正常化(Normalization)

Normalization is the process where the values are scaled in a range of -1,1 i.e. converting the values to a common scale. This ensures that the large values in the data set do not influence the learning process and have a similar impact on the model’s learning process. Normalization can be used when we want to quantify the similarity of any pair of samples such as dot-product.

规范化是在-1,1范围内缩放值的过程,即将值转换为通用比例。 这样可以确保数据集中的较大值不会影响学习过程,并且不会对模型的学习过程产生类似影响。 当我们要量化任意对样本(例如点积)的相似性时,可以使用归一化。

from sklearn import preprocessingimport numpy as npX = [[1,2,3],[4,-5,-6],[7,8,9]]y = preprocessing.normalize(X)y
Normalized data in data preprocessing
Author) Normalized data 作者提供的图像)归一化数据

This module also provides us an alternative for Transformer API, by using the Normalizer function which implements the same operation.

通过使用实现相同操作的Normalizer函数,该模块还为我们提供了Transformer API的替代方法。

编码分类特征 (Encoding categorical features)

Many times the data we use may not have the features values in a continuous form, but instead the forms of categories with text labels. To get this data processed by the machine learning model, it is necessary for converting these categorical features into a machine-understandable form.

很多时候,我们使用的数据可能没有连续形式的要素值,而是带有文本标签的类别形式。 为了使这些数据由机器学习模型处理,有必要将这些分类特征转换为机器可理解的形式。

There are two functions available in this module through which we can encode our categorical features:

此模块中有两个功能,通过它们可以对分类功能进行编码:

  • OrdinalEncoder: this is to convert categorical features to integer values such that the function converts each categorical feature to one new feature of integers (0 to n_categories — 1).

    OrdinalEncoder:这是将分类特征转换为整数值,以便函数将每个分类特征转换为一个新的整数特征(0到n_categories — 1)。

import sklearn.preprocessingimport numpy as npenc = preprocessing.OrdinalEncoder()X = [['a','b','c','d'], ['e', 'f', 'g', 'h'],['i','j','k','l']]enc.fit(X)enc.transform([['a', 'f', 'g','l']])

Here, three categories are encoded as 0,1,2 and the output result for the above input is:

这里,三个类别被编码为0,1,2 ,上述输入的输出结果为:

categorical encoding data preprocessing
Author) Encoded data 作者提供)编码数据
  • OneHotEncode: this encoder function transforms each categorical feature with n_categories possible values into n_categories binary features, with one of them 1, and all others 0. Check the following example for a better understanding.

    OneHotEncode:此编码器函数将具有n_categories可能值的每个分类特征转换为n_categories二进制特征,其中一个为1,所有其他为0。请查看以下示例,以更好地理解。

import sklearn.preprocessingimport numpy as npenc = preprocessing.OneHotEncoder()X = [['a','b','c','d'], ['e', 'f', 'g', 'h'],['i','j','k','l']]enc.fit(X)enc.transform([['a', 'f', 'g','l']]).toarray().reshape(4,3)
categorical encoding data preprocessing
Author) Encoded data 作者提供)编码数据

离散化(Discretization)

The process of discretization helps us to separate the continuous features of data into discrete values (also known as binning or quantization). This is similar to creating a histogram using continuous data (where discretization focuses on assigning feature values to these bins). Discretization can help us introduce non-linearity in linear models in some cases.

离散化过程有助于我们将数据的连续特征分离为离散值(也称为合并或量化)。 这类似于使用连续数据创建直方图(离散化重点在于将特征值分配给这些面元)。 在某些情况下,离散化可以帮助我们在线性模型中引入非线性。

import sklearn.preprocessing import numpy as npX = np.array([[ 1,2,3],              [-4,-5,6],              [7,8,9]])dis = preprocessing.KBinsDiscretizer(n_bins=[3, 2, 2], encode='ordinal')dis.fit_transform(X)

Using the KBinsDiscretizer(), the function discretizes the features into k bins. By default, the output is one-hot encoded, which we can change with the encode parameter.

使用KBinsDiscretizer() ,该函数将KBinsDiscretizer()离散化为k bin。 默认情况下,输出是单次热编码的,我们可以使用encode参数encode更改。

discretization data preprocessing
Author) Data discretization 作者提供)数据离散化

估算缺失值(Imputation of missing values)

This process is used to process the missing values in the data (NaNs, blanks, etcetera) by assigning a value to them (imputing- based on the known part of the dataset) so that the data can be processed by the model. Let’s understand this with an example:

此过程用于通过为数据分配缺失值(NaN,空白等)(根据数据集的已知部分进行插补)来处理数据中的缺失值,以便模型可以处理数据。 让我们通过一个例子来理解这一点:

from sklearn.impute import SimpleImputerimport numpy as npimpute = SimpleImputer(missing_values=np.nan, strategy='mean')X = [[np.nan, 1,2], [3,4, np.nan], [5, np.nan, 6]]impute.fit_transform(X)

Here, we have used SimpleImputer() function for imputing the missing values. The parameters used in this function are missing_values to specify the missing values to be imputed, strategy to specify how we want to impute the value, like in the above example we have used mean, this means that the missing values will be replaced by the mean of column values. We can use other parameters for strategy, like median, mode, most_frequent (based on the frequency of occurrence of particular value in a column), or constant (a constant value).

在这里,我们使用SimpleImputer()函数来估算缺失值。 此函数中使用的参数是missing_values用于指定要估算的缺失值; strategy用于指定我们要如何估算该值,例如在上面的示例中,我们使用过mean ,这意味着缺失值将被均值替换列值。 我们可以将其他参数用于strategy ,例如中位数, most_frequentmost_frequent (基于列中特定值的出现频率)或constant (常数)。

Imputation of missing values data preprocessing
Author) Imputing missing values 作者提供的图片)估算缺失值

生成多项式特征(Generating polynomial features)

To get greater accuracy in the results of our machine learning model, sometimes it is good to introduce complexity in the model (by adding non-linearity). We can simply implement this by using the function PolynomialFeatures().

为了在我们的机器学习模型的结果中获得更高的准确性,有时最好在模型中引入复杂性(通过添加非线性)。 我们可以通过使用PolynomialFeatures()函数简单地实现它。

import numpy as npfrom sklearn.preprocessing import PolynomialFeaturesx = np.array([[1,2],              [3,4]])nonl = PolynomialFeatures(2)nonl.fit_transform(x)
Generating polynomial features
Author) Generating polynomial features 作者提供的图片)生成多项式特征

In the example above, we have specified the degree of the non-linear model required to 2 in the PolynomialFeatures() function. The feature values of the input array are transformed from (X1, X2) to (1, X1, X2, X1², X1*X2, X2²).

在上面的示例中,我们在PolynomialFeatures()函数中将所需的非线性模型的PolynomialFeatures()指定为2 。 输入数组的特征值从(X1,X2)转换为(1,X1,X2,X1²,X1 * X2,X2²)。

定制变压器 (Custom transformers)

If it is required to transform the entire data using a particular function (existing in python) for any purpose like data processing or cleaning, we can create a custom transformer by implementing the function FunctionTransformer() and passing the required function through it.

如果需要出于某种目的(例如数据处理或清理)而使用特定函数(存在于python中)来转换整个数据,则可以通过实现FunctionTransformer()并将所需的函数传递给它来创建自定义转换器。

import sklearn.preprocessing import numpy as nptransformer = preprocessing.FunctionTransformer(np.log1p, validate=True)X = np.array([[1,2,3],              [4,5,6],              [7,8,9]])transformer.transform(X)

In this example, we have used the log function to transform our dataset values.

在此示例中,我们使用了log函数来转换我们的数据集值。

Implementing custom transformers
Author) Implementing custom transformers 作者提供的图片)实现自定义转换器

结论(Conclusion)

I hope with this article you would have understood the concepts and need of data preprocessing in machine learning models and will be able to apply these concepts in the real data sets.

我希望通过本文,您将了解机器学习模型中数据预处理的概念和需求,并能够将这些概念应用于实际数据集中。

For a better understanding of these concepts, I will recommend you try implementing these concepts on your once. Keep exploring, and I am sure you will discover new features along the way.

为了更好地理解这些概念,我建议您尝试一次实现这些概念。 继续探索,我相信您会在此过程中发现新功能。

If you have any questions or comments, please post them in the comment section.

如果您有任何问题或意见,请在评论部分中发布。

Check out the complete data visualization guide and essential functions of NumPy:

查看完整的数据可视化指南和NumPy的基本功能:

Originally published at: www.patataeater.blogspot.com

最初发布于: www.patataeater.blogspot.com

翻译自: https://towardsdatascience.com/introduction-to-data-preprocessing-67a67c42a036

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/笔触狂放9/article/detail/628481
推荐阅读
相关标签
  

闽ICP备14008679号