赞
踩
We have already studied supervised as well as unsupervised machine learning algorithms. These algorithms require formatted data to start the training process. We must prepare or format data in a certain way so that it can be supplied as an input to ML algorithms.
我们已经研究了有监督和无监督的机器学习算法。 这些算法需要格式化的数据才能开始训练过程。 我们必须以某种方式准备或格式化数据,以便可以将其作为ML算法的输入提供。
This chapter focuses on data preparation for machine learning algorithms.
本章重点介绍机器学习算法的数据准备。
In our daily life, we deal with lots of data but this data is in raw form. To provide the data as the input of machine learning algorithms, we need to convert it into a meaningful data. That is where data preprocessing comes into picture. In other simple words, we can say that before providing the data to the machine learning algorithms we need to preprocess the data.
在日常生活中,我们处理大量数据,但是这些数据是原始形式。 为了提供数据作为机器学习算法的输入,我们需要将其转换为有意义的数据。 这就是数据预处理的关键所在。 换句话说,我们可以说在将数据提供给机器学习算法之前,我们需要对数据进行预处理。
Follow these steps to preprocess the data in Python −
请按照以下步骤在Python中预处理数据-
Step 1 − Importing the useful packages − If we are using Python then this would be the first step for converting the data into a certain format, i.e., preprocessing. It can be done as follows −
步骤1-导入有用的软件包 -如果我们使用的是Python,那么这将是将数据转换为某种格式(即预处理)的第一步。 它可以做到如下-
-
- import numpy as np
- import sklearn.preprocessing
Here we have used the following two packages −
这里我们使用了以下两个包-
NumPy − Basically NumPy is a general purpose array-processing package designed to efficiently manipulate large multi-dimensional arrays of arbitrary records without sacrificing too much speed for small multi-dimensional arrays.
NumPy-基本上,NumPy是一个通用的数组处理程序包,旨在有效地处理任意记录的大型多维数组,而不会牺牲小型多维数组的速度。
Sklearn.preprocessing − This package provides many common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for machine learning algorithms.
Sklearn.preprocessing-该软件包提供了许多常用的实用函数和转换器类,以将原始特征向量更改为更适合机器学习算法的表示形式。
Step 2 − Defining sample data − After importing the packages, we need to define some sample data so that we can apply preprocessing techniques on that data. We will now define the following sample data −
步骤2-定义示例数据 -导入包后,我们需要定义一些示例数据,以便可以对这些数据应用预处理技术。 我们现在将定义以下样本数据-
-
- input_data = np.array([2.1, -1.9, 5.5],
- [-1.5, 2.4, 3.5],
- [0.5, -7.9, 5.6],
- [5.9, 2.3, -5.8])
Step3 − Applying preprocessing technique − In this step, we need to apply any of the preprocessing techniques.
步骤3-应用预处理技术 -在此步骤中,我们需要应用任何预处理技术。
The following section describes the data preprocessing techniques.
以下部分描述了数据预处理技术。
The techniques for data preprocessing are described below −
数据预处理技术如下所述-
This is the preprocessing technique which is used when we need to convert our numerical values into Boolean values. We can use an inbuilt method to binarize the input data say by using 0.5 as the threshold value in the following way −
这是一种预处理技术,当我们需要将数值转换为布尔值时使用。 我们可以通过以下方式使用内置方法对输入数据进行二值化处理:将0.5作为阈值-
-
- data_binarized = preprocessing.Binarizer(threshold = 0.5).transform(input_data)
- print("\nBinarized data:\n", data_binarized)
Now, after running the above code we will get the following output, all the values above 0.5(threshold value) would be converted to 1 and all the values below 0.5 would be converted to 0.
现在,运行上面的代码后,我们将获得以下输出,所有高于0.5(阈值)的值都将转换为1,而低于0.5的所有值都将转换为0。
Binarized data
二进制数据
-
- [[ 1. 0. 1.]
- [ 0. 1. 1.]
- [ 0. 0. 1.]
- [ 1. 1. 0.]]
It is another very common preprocessing technique that is used in machine learning. Basically it is used to eliminate the mean from feature vector so that every feature is centered on zero. We can also remove the bias from the features in the feature vector. For applying mean removal preprocessing technique on the sample data, we can write the Python code shown below. The code will display the Mean and Standard deviation of the input data −
这是机器学习中使用的另一种非常常见的预处理技术。 基本上,它用于消除特征向量的均值,以便每个特征都以零为中心。 我们还可以消除特征向量中特征的偏差。 为了对样本数据应用均值去除预处理技术,我们可以编写如下所示的Python代码。 该代码将显示输入数据的均值和标准差-
-
- print("Mean = ", input_data.mean(axis = 0))
- print("Std deviation = ", input_data.std(axis = 0))
We will get the following output after running the above lines of code −
运行以上代码行后,我们将获得以下输出:
-
- Mean = [ 1.75 -1.275 2.2]
- Std deviation = [ 2.71431391 4.20022321 4.69414529]
Now, the code below will remove the Mean and Standard deviation of the input data −
现在,下面的代码将删除输入数据的均值和标准差-
-
- data_scaled = preprocessing.scale(input_data)
- print("Mean =", data_scaled.mean(axis=0))
- print("Std deviation =", data_scaled.std(axis = 0))
We will get the following output after running the above lines of code −
运行以上代码行后,我们将获得以下输出:
-
- Mean = [ 1.11022302e-16 0.00000000e+00 0.00000000e+00]
- Std deviation = [ 1. 1. 1.]
It is another data preprocessing technique that is used to scale the feature vectors. Scaling of feature vectors is needed because the values of every feature can vary between many random values. In other words we can say that scaling is important because we do not want any feature to be synthetically large or small. With the help of the following Python code, we can do the scaling of our input data, i.e., feature vector −
这是另一种用于缩放特征向量的数据预处理技术。 需要对特征向量进行缩放,因为每个特征的值可以在许多随机值之间变化。 换句话说,我们可以说缩放很重要,因为我们不希望任何特征合成或放大或缩小。 借助以下Python代码,我们可以缩放输入数据,即特征向量-
# Min max scaling
#最小最大缩放
-
- data_scaler_minmax = preprocessing.MinMaxScaler(feature_range=(0,1))
- data_scaled_minmax = data_scaler_minmax.fit_transform(input_data)
- print ("\nMin max scaled data:\n", data_scaled_minmax)
We will get the following output after running the above lines of code −
运行以上代码行后,我们将获得以下输出:
Min max scaled data
最小最大缩放数据
-
- [ [ 0.48648649 0.58252427 0.99122807]
- [ 0. 1. 0.81578947]
- [ 0.27027027 0. 1. ]
- [ 1. 0. 99029126 0. ]]
It is another data preprocessing technique that is used to modify the feature vectors. Such kind of modification is necessary to measure the feature vectors on a common scale. Followings are two types of normalization which can be used in machine learning −
这是另一种用于修改特征向量的数据预处理技术。 此类修改对于在公共尺度上测量特征向量是必需的。 以下是可以在机器学习中使用的两种规范化类型-
L1 Normalization
L1归一化
It is also referred to as Least Absolute Deviations. This kind of normalization modifies the values so that the sum of the absolute values is always up to 1 in each row. It can be implemented on the input data with the help of the following Python code −
也称为最小绝对偏差 。 这种归一化修改值,以便绝对值的总和在每一行中始终最多为1。 可以通过以下Python代码在输入数据上实现它-
-
- # Normalize data
- data_normalized_l1 = preprocessing.normalize(input_data, norm = 'l1')
- print("\nL1 normalized data:\n", data_normalized_l1)
The above line of code generates the following output &miuns;
上面的代码行生成以下输出&miuns;。
-
- L1 normalized data:
- [[ 0.22105263 -0.2 0.57894737]
- [ -0.2027027 0.32432432 0.47297297]
- [ 0.03571429 -0.56428571 0.4 ]
- [ 0.42142857 0.16428571 -0.41428571]]
L2 Normalization
L2归一化
It is also referred to as least squares. This kind of normalization modifies the values so that the sum of the squares is always up to 1 in each row. It can be implemented on the input data with the help of the following Python code −
也称为最小二乘 。 这种归一化会修改值,以使每行的平方和始终等于1。 可以通过以下Python代码在输入数据上实现它-
-
- # Normalize data
- data_normalized_l2 = preprocessing.normalize(input_data, norm = 'l2')
- print("\nL2 normalized data:\n", data_normalized_l2)
The above line of code will generate the following output −
上面的代码行将生成以下输出-
-
- L2 normalized data:
- [[ 0.33946114 -0.30713151 0.88906489]
- [ -0.33325106 0.53320169 0.7775858 ]
- [ 0.05156558 -0.81473612 0.57753446]
- [ 0.68706914 0.26784051 -0.6754239 ]]
We already know that data in a certain format is necessary for machine learning algorithms. Another important requirement is that the data must be labelled properly before sending it as the input of machine learning algorithms. For example, if we talk about classification, there are lot of labels on the data. Those labels are in the form of words, numbers, etc. Functions related to machine learning in sklearn expect that the data must have number labels. Hence, if the data is in other form then it must be converted to numbers. This process of transforming the word labels into numerical form is called label encoding.
我们已经知道,某种格式的数据对于机器学习算法是必需的。 另一个重要的要求是,在将数据作为机器学习算法的输入发送之前,必须正确标记数据。 例如,如果我们谈论分类,则数据上有很多标签。 这些标签采用单词,数字等形式。与sklearn中的机器学习相关的功能期望数据必须具有数字标签。 因此,如果数据为其他形式,则必须将其转换为数字。 将单词标签转换为数字形式的过程称为标签编码。
Follow these steps for encoding the data labels in Python −
请按照以下步骤在Python中对数据标签进行编码-
Step1 − Importing the useful packages
步骤1-导入有用的软件包
If we are using Python then this would be first step for converting the data into certain format, i.e., preprocessing. It can be done as follows −
如果我们使用的是Python,那么这将是将数据转换为某种格式(即预处理)的第一步。 它可以做到如下-
-
- import numpy as np
- from sklearn import preprocessing
Step 2 − Defining sample labels
步骤2-定义样本标签
After importing the packages, we need to define some sample labels so that we can create and train the label encoder. We will now define the following sample labels −
导入软件包后,我们需要定义一些样本标签,以便我们可以创建和训练标签编码器。 我们现在将定义以下样本标签-
-
- # Sample input labels
- input_labels = ['red','black','red','green','black','yellow','white']
Step 3 − Creating & training of label encoder object
步骤3-创建和训练标签编码器对象
In this step, we need to create the label encoder and train it. The following Python code will help in doing this −
在这一步中,我们需要创建标签编码器并对其进行训练。 以下Python代码将有助于实现这一点-
-
- # Creating the label encoder
- encoder = preprocessing.LabelEncoder()
- encoder.fit(input_labels)
Following would be the output after running the above Python code −
以下是运行上述Python代码后的输出-
-
- LabelEncoder()
Step4 − Checking the performance by encoding random ordered list
步骤4-通过编码随机有序列表来检查性能
This step can be used to check the performance by encoding the random ordered list. Following Python code can be written to do the same −
通过编码随机有序列表,此步骤可用于检查性能。 可以编写以下Python代码以执行相同的操作-
-
- # encoding a set of labels
- test_labels = ['green','red','black']
- encoded_values = encoder.transform(test_labels)
- print("\nLabels =", test_labels)
The labels would get printed as follows −
标签将如下打印-
-
- Labels = ['green', 'red', 'black']
Now, we can get the list of encoded values i.e. word labels converted to numbers as follows −
现在,我们可以获得编码值的列表,即将单词标签转换为数字,如下所示:
-
- print("Encoded values =", list(encoded_values))
The encoded values would get printed as follows −
编码值将如下打印-
-
- Encoded values = [1, 2, 0]
Step 5 − Checking the performance by decoding a random set of numbers −
步骤5-通过解码一组随机数检查性能-
This step can be used to check the performance by decoding the random set of numbers. Following Python code can be written to do the same −
此步骤可用于通过解码随机数字集来检查性能。 可以编写以下Python代码以执行相同的操作-
-
- # decoding a set of values
- encoded_values = [3,0,4,1]
- decoded_list = encoder.inverse_transform(encoded_values)
- print("\nEncoded values =", encoded_values)
Now, Encoded values would get printed as follows −
现在,编码值将如下打印:
-
- Encoded values = [3, 0, 4, 1]
- print("\nDecoded labels =", list(decoded_list))
Now, decoded values would get printed as follows −
现在,解码后的值将如下打印:
-
- Decoded labels = ['white', 'black', 'yellow', 'green']
Unlabeled data mainly consists of the samples of natural or human-created object that can easily be obtained from the world. They include, audio, video, photos, news articles, etc.
未标记的数据主要由可以很容易地从世界上获得的自然或人类创造的物体样本组成。 它们包括音频,视频,照片,新闻文章等。
On the other hand, labeled data takes a set of unlabeled data and augments each piece of that unlabeled data with some tag or label or class that is meaningful. For example, if we have a photo then the label can be put based on the content of the photo, i.e., it is photo of a boy or girl or animal or anything else. Labeling the data needs human expertise or judgment about a given piece of unlabeled data.
另一方面,标记数据采用一组未标记数据,并使用有意义的某些标记或标签或类来扩充该未标记数据的每一条。 例如,如果我们有照片,则可以基于照片的内容放置标签,即它是男孩或女孩或动物或其他任何东西的照片。 标记数据需要人类专业知识或对给定的未标记数据的判断。
There are many scenarios where unlabeled data is plentiful and easily obtained but labeled data often requires a human/expert to annotate. Semi-supervised learning attempts to combine labeled and unlabeled data to build better models.
在很多情况下,未标记的数据很多并且很容易获得,但是标记的数据通常需要人工/专家进行注释。 半监督学习尝试结合标记和未标记的数据来构建更好的模型。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。