赞
踩
TF-IDF全称term frequency-inverse document frequency,词频逆文档频率,是通过单词在文档中出现的频率来衡量其权重,也就是说,IDF的大小与一个词的常见程度成反比,这个词越常见,编码后为它设置的权重会倾向于越小,以此来压制频繁出现的一些无意义的词。在sklearn当中,我们使用feature_extraction.text中类TfidfVectorizer来执行这种编码。
TfidfVectorizer可以把CountVectorizer, TfidfTransformer合并起来,直接生成TF-IDF值。
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
tfidf_vec = TfidfVectorizer()
tfidf_matrix = tfidf_vec.fit_transform(corpus)
sample = ["Machine learning is fascinating, it is wonderful"
,"Machine learning is a sensational techonology"
,"Elsa is a popular character"]
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
X = vec.fit_transform(sample)
#使用接口get_feature_names()调用每个列的名称
import pandas as pd
#注意稀疏矩阵是无法输入pandas的
CVresult = pd.DataFrame(X.toarray(),columns = vec.get_feature_names())
CVresult
from sklearn.feature_extraction.text import TfidfVectorizer as TFIDF
vec = TFIDF()
X = vec.fit_transform(sample)
#同样使用接口get_feature_names()调用每个列的名称
TFIDFresult = pd.DataFrame(X.toarray(),columns=vec.get_feature_names())
TFIDFresult
#使用TF-IDF编码之后,出现得多的单词的权重被降低了么?
CVresult.sum(axis=0)/CVresult.sum(axis=0).sum()
TFIDFresult.sum(axis=0) / TFIDFresult.sum(axis=0).sum()
那么什么是互信息呢?变量x与变量y之间的互信息,可以用来衡量已知变量x时变量y的不确定性减少的程度,同样的,也可以衡量已知变量y时变量x的不确定性减少的程度。
互信息是基于熵而得到的。什么是熵呢?一个随机变量的熵是用来衡量它的不确定性的。比如,对于变量y,熵的计算公式如下
当变量y是离散变量时,则累加即可,而当变量y是连续变量时,则需要通过积分方法来计算。其实,熵可以解释为表示变量y所需二进制位的平均值。
## 使用互信息特征筛选
from sklearn.feature_selection import mutual_info_classif
mutual_values = mutual_info_classif(X_train , y_train)
print(互信息中位数为', np.median(mutual_values))
print('互信息为0的个数为',sum(mutual_values == 0))
## 抽取互信息大于0的idx
idx = [i for i, value in enumerate(mutual_values) if value > 0]
## 特征提取
X_train = X_train[idx]
import pickle
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn import linear_model
with open(r'concat_10_data_plus_vam.pkl', 'rb') as f:
data = pickle.load(f)
label = pickle.load(f)
x_train, x_test, y_train, y_test = train_test_split(data, label, random_state=42, test_size=0.5)
vectorizer = CountVectorizer(ngram_range=(1, 2), max_features=50000)
x_train = vectorizer.fit_transform(x_train)
x_test = vectorizer.transform(x_test)
model = linear_model.LogisticRegression(n_jobs=-1)
model.fit(x_train, y_train)
model.score(x_test, y_test)
结果为0.9637368731087971
import pickle
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import linear_model
with open(r'concat_10_data_plus_vam.pkl', 'rb') as f:
data = pickle.load(f)
label = pickle.load(f)
x_train, x_test, y_train, y_test = train_test_split(data, label, random_state=42, test_size=0.5)
vectorizer = TfidfVectorizer(ngram_range=(1, 2), max_features=50000)
x_train = vectorizer.fit_transform(x_train)
x_test = vectorizer.transform(x_test)
model = linear_model.LogisticRegression(n_jobs=-1)
model.fit(x_train, y_train)
model.score(x_test, y_test)
结果为0.9632015319583317
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。