赞
踩
特征提取,转换和选择
Extracting, transforming and selecting features
This section covers algorithms for working with features, roughly divided into these groups:
• Extraction: Extracting features from “raw” data
• Transformation: Scaling, converting, or modifying features
• Selection: Selecting a subset from a larger set of features
• Locality Sensitive Hashing (LSH): This class of algorithms combines aspects of feature transformation with other algorithms.
本节涵盖使用功能的算法,大致分为以下几类:
• 提取:从“原始”数据中提取特征
• 转换:缩放,转换或修改特征
• 选择:从更大的特征集中选择一个子集
• 局部敏感哈希(LSH):此类算法将特征转换的各个方面与其它算法结合在一起。
Table of Contents
• Feature Extractors
o TF-IDF
o Word2Vec
o CountVectorizer
o FeatureHasher
• Feature Transformers
o Tokenizer
o StopWordsRemover
o nn-gram
o Binarizer
o PCA
o PolynomialExpansion
o Discrete Cosine Transform (DCT)
o StringIndexer
o IndexToString
o OneHotEncoder
o VectorIndexer
o Interaction
o Normalizer
o StandardScaler
o RobustScaler
o MinMaxScaler
o MaxAbsScaler
o Bucketizer
o ElementwiseProduct
o SQLTransformer
o VectorAssembler
o VectorSizeHint
o QuantileDiscretizer
o Imputer
• Feature Selectors
o VectorSlicer
o RFormula
o ChiSqSelector
o UnivariateFeatureSelector
o VarianceThresholdSelector
• Locality Sensitive Hashing
o LSH Operations
Feature Transformation
Approximate Similarity Join
Approximate Nearest Neighbor Search
o LSH Algorithms
Bucketed Random Projection for Euclidean Distance
MinHash for Jaccard Distance
Feature Extractors
TF-IDF
Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus. Denote a term by t, a document by d, and the corpus by D. Term frequency TF(t,d) is the number of times that term t appears in document d, while document frequency DF(t,D) is the number of documents that contains term t. If we only use term frequency to measure the importance, it is very easy to over-emphasize terms that appear very often but carry little information about the document, e.g. “a”, “the”, and “of”. If a term appears very often across the corpus, it means it doesn’t carry special information about a particular document. Inverse document frequency is a numerical measure of how much information a term provides:
变量逆频率文档频率(TF-IDF) 是一种特征向量化方法,广泛用于文本挖掘中,反映变量对语料库中文档的重要性。用t表示变量,用d表示文档,用D表示语料库。变量频率TF(t,d)是变量t在文档d中出现的次数,而文档频率DF(t,D)是包含变量t的文档数。如果仅使用变量频率来衡量重要性,则过分强调那些经常出现,但几乎不包含有关文档信息的变量,例如“一个a”,“该the”和“属于of”。如果变量经常出现在整个语料库中,则表示该变量不包含有关特定文档的特殊信息。逆文档频率是一个变量大小信息,提供了一个数值量度:
where |D| is the total number of documents in the corpus. Since logarithm is used, if a term appears in all documents, its IDF value becomes 0. Note that a smoothing term is applied to avoid dividing by zero for terms outside the corpus. The TF-IDF measure is simply the product of TF and IDF:
其中|D|是所述语料库中的文件的总数。由于使用对数,因此如果一个变量出现在所有文档中,则其IDF值将变为0。注意,应用了平滑变量以避免对主体外的变量除以零。TF-IDF度量只是TF和IDF的乘积:
There are several variants on the definition of term frequency and document frequency. In MLlib, we separate TF and IDF to make them flexible.
TF: Both HashingTF and CountVectorizer can be used to generate the term frequency vectors.
HashingTF is a Transformer which takes sets of terms and converts those sets into fixed-length feature vectors. In text processing, a “set of terms” might be a bag of words. HashingTF utilizes the hashing trick. A raw feature is mapped into an index (term) by applying a hash function. The hash function used here is MurmurHash 3. Then term frequencies are calculated based on the mapped indices. This approach avoids the need to compute a global term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash collisions, where different raw features may become the same term after hashing. To reduce the chance of collision, we can increase the target feature dimension, i.e. the number of buckets of the hash table. Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the feature dimension, otherwise the features will not be mapped evenly to the vector indices. The default feature dimension is 218=262,144218=262,144. An optional binary toggle parameter controls term frequency counts. When set to true all nonzero frequency counts are set to 1. This is especially useful for discrete probabilistic models that model binary, rather than integer, counts.
CountVectorizer converts text documents to vectors of term counts. Refer to CountVectorizer for more details.
IDF: IDF is an Estimator which is fit on a dataset and produces an IDFModel. The IDFModel takes feature vectors (generally created from HashingTF or CountVectorizer) and scales each feature. Intuitively, it down-weights features which appear frequently in a corpus.
Note: spark.ml doesn’t provide tools for text segmentation. We refer users to the Stanford NLP Group and scalanlp/chalk.
Examples
In the following code segment, we start with a set of sentences. We split each sentence into words using Tokenizer. For each sentence (bag of words), we use HashingTF to hash the sentence into a feature vector. We use IDF to rescale the feature vectors; this generally improves performance when using text as features. Our feature vectors could then be passed to a learning algorithm.
变量频率和文档频率的定义有多种变体。在MLlib中,将TF和IDF分开以使其具有灵活性。
TF:HashingTF和CountVectorizer均可用于生成项频率向量。
HashingTF是,Transformer接受一组变量并将其转换为固定长度的特征向量。在文本处理中,“一组变量”可能是一袋单词。 HashingTF利用哈希理论。通过应用哈希函数将原始特征映射到索引(项)。这里使用的哈希函数是MurmurHash 3。然后根据映射的索引计算词频。这种方法避免了需要计算全局项到索引图的情况,对于大型语料库可能是昂贵的,但是会遭受潜在的哈希冲突,即哈希后不同的原始特征可能变成相同的变量。为了减少冲突的概率,可以增加目标要素的维数,即哈希表的存储数。使用散列值的简单模来确定向量索引,建议使用2的幂作为特征维,否则特征将不会均匀地映射到向量索引。默认特征尺寸为
。可选的二进制切换参数控制项频率计数。当设置为true时,所有非零频率计数都设置为1。对于模拟二进制而不是整数计数的离散概率模型特别有用。
CountVectorizer将文本文档转换为变量计数向量。有关更多详细信息,请参考CountVectorizer 。
IDF:IDF是Estimator适合数据集,产生的IDFIDFModel。所述 IDFModel需要的特征向量(通常从创建HashingTF或CountVectorizer)和缩放每个特征。直观地,会减少在语料库中经常出现的特征的权重。
注意: spark.ml不提供用于文本分割的工具。将用户推荐给Stanford NLP Group和 scalanlp / chalk。
例子
在下面的代码段中,从一组句子开始。使用将每个句子分成单词Tokenizer。对于每个句子(单词袋),用HashingTF将句子散列为特征向量。IDF用来重新缩放特征向量;使用文本作为特征时,通常可以提高性能。然后,特征向量可以传递给学习算法。
• Scala
• Java
• Python
Refer to the HashingTF Scala docs and the IDF Scala docs for more details on the API.
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
val sentenceData = spark.createDataFrame(Seq(
(0.0, “Hi I heard about Spark”),
(0.0, “I wish Java could use case classes”),
(1.0, “Logistic regression models are neat”)
)).toDF(“label”, “sentence”)
val tokenizer = new Tokenizer().setInputCol(“sentence”).setOutputCol(“words”)
val wordsData = tokenizer.transform(sentenceData)
val hashingTF = new HashingTF()
.setInputCol(“words”).setOutputCol(“rawFeatures”).setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData)
// alternatively, CountVectorizer can also be used to get term frequency vectors
val idf = new IDF().setInputCol(“rawFeatures”).setOutputCol(“features”)
val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData)
rescaledData.select(“label”, “features”).show()
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/TfIdfExample.scala” in the Spark repo.
Word2Vec
Word2Vec is an Estimator which takes sequences of words representing documents and trains a Word2VecModel. The model maps each word to a unique fixed-size vector. The Word2VecModel transforms each document into a vector using the average of all words in the document; this vector can then be used as features for prediction, document similarity calculations, etc. Please refer to the MLlib user guide on Word2Vec for more details. Word2Vec是一个Estimator,表示文档的单词序列并训练一个 Word2VecModel。该模型将每个单词映射到唯一的固定大小的向量。使用Word2VecModel 文档中所有单词的平均值,将每个文档转换为向量;然后,可以将此向量用作预测,文档相似度计算等的功能。有关更多详细信息,可参考Word2Vec上的MLlib用户指南。
Examples
In the following code segment, we start with a set of documents, each of which is represented as a sequence of words. For each document, we transform it into a feature vector. This feature vector could then be passed to a learning algorithm. 在下面的代码段中,从一组文档开始,每个文档都由一个单词序列表示。对于每个文档,将其转换为特征向量。然后可以将该特征向量传递给学习算法。
• Scala
• Java
• Python
Refer to the Word2Vec Scala docs for more details on the API.
import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
// Input data: Each row is a bag of words from a sentence or document.
val documentDF = spark.createDataFrame(Seq(
“Hi I heard about Spark”.split(" “),
“I wish Java could use case classes”.split(” “),
“Logistic regression models are neat”.split(” ")
).map(Tuple1.apply)).toDF(“text”)
// Learn a mapping from words to Vectors.
val word2Vec = new Word2Vec()
.setInputCol(“text”)
.setOutputCol(“result”)
.setVectorSize(3)
.setMinCount(0)
val model = word2Vec.fit(documentDF)
val result = model.transform(documentDF)
result.collect().foreach { case Row(text: Seq[_], features: Vector) =>
println(s"Text: [${text.mkString(", “)}] => \nVector: $features\n”) }
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/Word2VecExample.scala” in the Spark repo.
CountVectorizer
CountVectorizer and CountVectorizerModel aim to help convert a collection of text documents to vectors of token counts. When an a-priori dictionary is not available, CountVectorizer can be used as an Estimator to extract the vocabulary, and generates a CountVectorizerModel. The model produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA.
During the fitting process, CountVectorizer will select the top vocabSize words ordered by term frequency across the corpus. An optional parameter minDF also affects the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. Another optional binary toggle parameter controls the output vector. If set to true all nonzero counts are set to 1. This is especially useful for discrete probabilistic models that model binary, rather than integer, counts.
CountVectorizer和CountVectorizerModel,帮助转换文本文档的集合令牌计数的载体。当先验词典不可用时,CountVectorizer可以用作Estimator,提取词汇表并生成CountVectorizerModel。该模型为词汇表上的文档生成稀疏表示,然后可以将其传递给其它算法,例如LDA。
在拟合过程中,CountVectorizer将选择vocabSize整个语料库中,按词频排列的前几个词。可选参数minDF,通过指定一个单词必须出现在词汇表中的最小数量(如果小于1.0,则为小数)来影响拟合过程。另一个可选的二进制,切换参数控制输出向量。如果将其设置为true,则所有非零计数都将设置为1。这对于模拟二进制,而不是整数计数的离散概率模型特别有用。
Examples
Assume that we have the following DataFrame with columns id and texts:
假设有以下带有列id和 texts的DataFrame:
id | texts |
---|---|
0 | Array(“a”, “b”, “c”) |
1 | Array(“a”, “b”, “b”, “c”, “a”) |
each row in texts is a document of type Array[String]. Invoking fit of CountVectorizer produces a CountVectorizerModel with vocabulary (a, b, c). Then the output column “vector” after transformation contains: 每行texts是一个Array [String]类型的文档。调用的契合度CountVectorizer会产生CountVectorizerModel带有词汇量(a,b,c)的a。然后,转换后的输出列“ vector”包含:
id | texts | vector |
---|---|---|
0 | Array(“a”, “b”, “c”) | (3,[0,1,2],[1.0,1.0,1.0]) |
1 | Array(“a”, “b”, “b”, “c”, “a”) | (3,[0,1,2],[2.0,2.0,1.0]) |
Each vector represents the token counts of the document over the vocabulary.
• Scala
• Java
• Python
Refer to the CountVectorizer Scala docs and the CountVectorizerModel Scala docs for more details on the API. 有关API的更多详细信息,参考CountVectorizer Scala文档 和CountVectorizerModel Scala文档。
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
val df = spark.createDataFrame(Seq(
(0, Array(“a”, “b”, “c”)),
(1, Array(“a”, “b”, “b”, “c”, “a”))
)).toDF(“id”, “words”)
// fit a CountVectorizerModel from the corpus
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol(“words”)
.setOutputCol(“features”)
.setVocabSize(3)
.setMinDF(2)
.fit(df)
// alternatively, define CountVectorizerModel with a-priori vocabulary
val cvm = new CountVectorizerModel(Array(“a”, “b”, “c”))
.setInputCol(“words”)
.setOutputCol(“features”)
cvModel.transform(df).show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/CountVectorizerExample.scala” in the Spark repo.
FeatureHasher
Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space). This is done using the hashing trick to map features to indices in the feature vector.
The FeatureHasher transformer operates on multiple columns. Each column may contain either numeric or categorical features. Behavior and handling of column data types is as follows:
• Numeric columns: For numeric features, the hash value of the column name is used to map the feature value to its index in the feature vector. By default, numeric features are not treated as categorical (even when they are integers). To treat them as categorical, specify the relevant columns using the categoricalCols parameter.
• String columns: For categorical features, the hash value of the string “column_name=value” is used to map to the vector index, with an indicator value of 1.0. Thus, categorical features are “one-hot” encoded (similarly to using OneHotEncoder with dropLast=false).
• Boolean columns: Boolean values are treated in the same way as string columns. That is, boolean features are represented as “column_name=true” or “column_name=false”, with an indicator value of 1.0.
Null (missing) values are ignored (implicitly zero in the resulting feature vector).
The hash function used here is also the MurmurHash 3 used in HashingTF. Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the vector indices.
特征哈希将一组分类或数字特征投影到指定维数的特征向量中(通常大大小于原始特征空间的特征向量)。这是通过使用哈希技巧 将特征映射到特征向量中的索引来完成的。
该FeatureHasher变压器上多列运行。每列都可以包含数字或分类特征。列数据类型的行为和处理如下:
• 数字列:对于数字特征,列名称的哈希值用于将特征值映射到特征向量中的索引。默认情况下,数字功能不被视为分类(即使是整数)。要将其视为分类,使用categoricalCols参数指定相关列。
• 字符串列:对于分类特征,字符串“ column_name = value”的哈希值,用于映射到向量索引,指示符值为1.0。因此,分类特征被“一次热”编码(类似于将OneHotEncoder与一起使用 dropLast=false)。
• 布尔列:布尔值与字符串列的处理方式相同。即,布尔特征表示为“ column_name = true”或“ column_name = false”,指示符值为1.0。
空(缺失)值将被忽略(在所得特征向量中隐式为零)。
这里使用的哈希函数也是HashingTF中 使用的MurmurHash 3。由于使用散列值的简单模来确定向量索引,因此建议使用2的幂作为numFeatures参数;否则,建议使用2的幂。不然,这些特征将不会均匀地映射到矢量索引。
Examples
Assume that we have a DataFrame with 4 input columns real, bool, stringNum, and string. These different data types as input will illustrate the behavior of the transform to produce a column of feature vectors. 假设有4个输入列的数据帧real,bool,stringNum,和string。这些不同的数据类型作为输入,将生成一列特征向量的变换。
real | bool | stringNum | string |
---|---|---|---|
2.2 | true | 1 | foo |
3.3 | false | 2 | bar |
4.4 | false | 3 | baz |
5.5 | false | 4 | foo |
Then the output of FeatureHasher.transform on this DataFrame is:
real | bool | stringNum | string | features |
---|---|---|---|---|
2.2 | true | 1 | foo | (262144,[51871, 63643,174475,253195],[1.0,1.0,2.2,1.0]) |
3.3 | false | 2 | bar | (262144,[6031, 80619,140467,174475],[1.0,1.0,1.0,3.3]) |
4.4 | false | 3 | baz | (262144,[24279,140467,174475,196810],[1.0,1.0,4.4,1.0]) |
5.5 | false | 4 | foo | (262144,[63643,140467,168512,174475],[1.0,1.0,1.0,5.5]) |
The resulting feature vectors could then be passed to a learning algorithm.
• Scala
• Java
• Python
Refer to the FeatureHasher Scala docs for more details on the API.
import org.apache.spark.ml.feature.FeatureHasher
val dataset = spark.createDataFrame(Seq(
(2.2, true, “1”, “foo”),
(3.3, false, “2”, “bar”),
(4.4, false, “3”, “baz”),
(5.5, false, “4”, “foo”)
)).toDF(“real”, “bool”, “stringNum”, “string”)
val hasher = new FeatureHasher()
.setInputCols(“real”, “bool”, “stringNum”, “string”)
.setOutputCol(“features”)
val featurized = hasher.transform(dataset)
featurized.show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/FeatureHasherExample.scala” in the Spark repo.
Feature Transformers
Tokenizer
Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple Tokenizer class provides this functionality. The example below shows how to split sentences into sequences of words.
RegexTokenizer allows more advanced tokenization based on regular expression (regex) matching. By default, the parameter “pattern” (regex, default: “\s+”) is used as delimiters to split the input text. Alternatively, users can set parameter “gaps” to false indicating the regex “pattern” denotes “tokens” rather than splitting gaps, and find all matching occurrences as the tokenization result.
标记化是获取文本(例如句子),并将其分解为单个术语(通常是单词)的过程。一个简单的Tokenizer类提供了此功能。下面的示例显示了如何将句子分成单词序列。
RegexTokenizer允许基于正则表达式(regex)匹配,进行更高级的标记化。默认情况下,参数“ pattern”(正则表达式,默认值:),"\s+"用作分隔输入文本的定界符。或者,用户可以将参数“ gap”设置为false,以表示正则表达式“ pattern”表示“令牌”,而不是拆分间隙,并找到所有匹配的出现作为标记化结果。
Examples
• Scala
• Java
• Python
Refer to the Tokenizer Scala docs and the RegexTokenizer Scala docs for more details on the API. 有关API的更多详细信息,可参考Tokenizer Scala文档 和RegexTokenizer Scala文档。
import org.apache.spark.ml.feature.{RegexTokenizer, Tokenizer}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
val sentenceDataFrame = spark.createDataFrame(Seq(
(0, “Hi I heard about Spark”),
(1, “I wish Java could use case classes”),
(2, “Logistic,regression,models,are,neat”)
)).toDF(“id”, “sentence”)
val tokenizer = new Tokenizer().setInputCol(“sentence”).setOutputCol(“words”)
val regexTokenizer = new RegexTokenizer()
.setInputCol(“sentence”)
.setOutputCol(“words”)
.setPattern("\W") // alternatively .setPattern("\w+").setGaps(false)
val countTokens = udf { (words: Seq[String]) => words.length }
val tokenized = tokenizer.transform(sentenceDataFrame)
tokenized.select(“sentence”, “words”)
.withColumn(“tokens”, countTokens(col(“words”))).show(false)
val regexTokenized = regexTokenizer.transform(sentenceDataFrame)
regexTokenized.select(“sentence”, “words”)
.withColumn(“tokens”, countTokens(col(“words”))).show(false)
Find full example code at “examples/src/main/scala/org/apache/spark/examples/ml/TokenizerExample.scala” in the Spark repo.
StopWordsRemover
Stop words are words which should be excluded from the input, typically because the words appear frequently and don’t carry as much meaning.
StopWordsRemover takes as input a sequence of strings (e.g. the output of a Tokenizer) and drops all the stop words from the input sequences. The list of stopwords is specified by the stopWords parameter. Default stop words for some languages are accessible by calling StopWordsRemover.loadDefaultStopWords(language), for which available options are “danish”, “dutch”, “english”, “finnish”, “french”, “german”, “hungarian”, “italian”, “norwegian”, “portuguese”, “russian”, “spanish”, “swedish” and “turkish”. A boolean parameter caseSensitive indicates if the matches should be case sensitive (false by default).
停用词是应从输入中排除的词,通常是因为这些词频繁出现且含义不大。
StopWordsRemover将一个字符串序列(例如Tokenizer的输出)作为输入,并从输入序列中删除所有停用词。停用词列表由stopWords参数指定。可以通过调用来访问某些语言的默认停用词StopWordsRemover.loadDefaultStopWords(language),其可用选项为“丹麦语”,“荷兰语”,“英语”,“芬兰语”,“法语”,“德语”,“匈牙利语”,“意大利语”,“挪威语” ”,“葡萄牙语”,“俄语”,“西班牙语”,“瑞典语”和“土耳其语”。布尔参数caseSensitive表示匹配项是否区分大小写(默认情况下为false)。
Examples
Assume that we have the following DataFrame with columns id and raw:
id | raw |
---|---|
0 | [I, saw, the, red, balloon] |
1 | [Mary, had, a, little, lamb] |
Applying StopWordsRemover with raw as the input column and filtered as the output column, we should get the
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。