赞
踩
目录
2、Count Vecotrs(Bag of Words词袋模型)
N-gram与Count Vectors类似,加入了相邻单词组合成为新的单词,并进行计数。
这几种文本表示方法存在的缺陷:转换得到的向量维度很高,需要较长的训练实践;没有考虑单词与单词之间的关系,只是进行了统计。
from sklearn.feature_extraction.text import CountVectorizer
- #CountVectors+RidgeClassifier
- import pandas as pd
- from sklearn.feature_extraction.text import CountVectorizer
- from sklearn.linear_model import RidgeClassifier
- from sklearn.metrics import f1_score
- from sklearn.model_selection import train_test_split
-
-
-
- df = pd.read_csv('新建文件夹/天池—新闻文本分类/train_set.csv', sep='\t',nrows = 15000)
- ##统计每个字出现的次数,并赋值为0/1 用词袋表示text(特征集)
- ##max_features=3000文档中出现频率最多的前3000个词
- #ngram_range(1,3)(单个字,两个字,三个字 都会统计
- vectorizer = CountVectorizer(max_features = 3000,ngram_range=(1,3))
- train_text = vectorizer.fit_transform(train_df['text'])
-
- X_train,X_val,y_train,y_val = train_test_split(train_text,df.label,test_size = 0.3)
-
-
- #岭回归拟合训练集(包含text 和 label)
- clf = RidgeClassifier()
- clf.fit(X_train,y_train)
- val_pred = clf.predict(X_test)
- print(f1_score(y_val,val_pred,average = 'macro'))
TF-IDF 分数由两部分组成:第一部分是词语频率(Term Frequency),第二部分是逆文档频率(Inverse Document Frequency)。其中计算语料库中文档总数除以含有该词语的文档数量,然后再取对数就是逆文档频率。
当有TF(词频)和IDF(逆文档频率)后,将这两个词相乘,就能得到一个词的TF-IDF的值。某个词在文章中的TF-IDF越大,那么一般而言这个词在这篇文章的重要性会越高,所以通过计算文章中各个词的TF-IDF,由大到小排序,排在最前面的几个词,就是该文章的关键词。
from sklearn.feature_extraction.text import TfidfVectorizer
- from sklearn.feature_extraction.text import TfidfVectorizer
- import numpy as np
-
- corpus = [
- 'This is the first document.',
- 'This document is the second document.',
- 'And this is the third one.',
- 'Is this the first document?',
- ]
-
- vectorizer = TfidfVectorizer()
- X = vectorizer.fit_transform(corpus)# 得到tf-idf矩阵,稀疏矩阵表示法
-
- vectorizer.get_feature_names()
-
- X.toarray()
- #最后to_array()函数返回的是每个文档中关键词的tf-idf值
-
- #将每个文档的toptf-idf值输出
- word = vectorizer.get_feature_names()
- #['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
-
- weight = X.toarray()
-
- for i in range(len(weight)):
- w_sort = np.argsort(-weight[i])
-
- print('doc: {0}, top tf-idf is : {1},{2}'.format(corpus[i], word[w_sort[0]], weight[i][w_sort[0]]) )
-
- from sklearn.feature_extraction.text import TfidfVectorizer
- document = ["I have a pen.",
- "I have an apple."]
- tfidf_model = TfidfVectorizer().fit(document)
- sparse_result = tfidf_model.transform(document) # 得到tf-idf矩阵,稀疏矩阵表示法
- word = tfidf_model.get_feature_names()
- word
- # ['an', 'apple', 'have', 'pen']
-
- print(sparse_result) # 第0个字符串,对应词典序号为3的词的TFIDF为0.8148
-
- # 词语与列的对应关系
- # '''
- # (0, 3) 0.8148024746671689
- # (0, 2) 0.5797386715376657
- # (1, 2) 0.4494364165239821
- # (1, 1) 0.6316672017376245
- # (1, 0) 0.6316672017376245
- # '''
TF-IDF在用之前,要经过分词处理,使用工具jieba进行分词
直接在jupyter notebook 代码栏中下载
pip install jieba
-
- import jieba
- text = """我是一条天狗呀!
- 我把月来吞了,
- 我把日来吞了,
- 我把一切的星球来吞了,
- 我把全宇宙来吞了。
- 我便是我了!"""
-
- sentences = text.split()
- sent_words = [list(jieba.cut(sen0)) for sen0 in sentences ]
- document= [' '.join(sen0) for sen0 in sent_words]
- print(document)
- # ['我 是 一条 天狗 呀 !', '我 把 月 来 吞 了 ,', '我 把 日来 吞 了 ,', '我 把 一切 的 星球 来 吞 了 ,
- # ', '我 把 全宇宙 来 吞 了 。', '我 便是 我 了 !']
-
- model = TfidfVectorizer().fit(document)
- print(model.vocabulary_)
- # {'一条': 1, '天狗': 4, '日来': 5, '一切': 0, '星球': 6, '全宇宙': 3, '便是': 2}
-
- sparse_result = model.transform(document)
- print(sparse_result)
- '''
- (0, 4) 0.7071067811865476
- (0, 1) 0.7071067811865476
- (2, 5) 1.0
- (3, 6) 0.7071067811865476
- (3, 0) 0.7071067811865476
- (4, 3) 1.0
- (5, 2) 1.0'''
- #TF-IDF + RidgeClassifier
- import pandas as pd
-
- from sklearn.feature_extraction.text import TfidfVectorizer
- from sklearn.linear_model import RidgeClassifier
- from sklearn.metrics import f1_score
-
-
- df = pd.read_csv('新建文件夹/天池—新闻文本分类/train_set.csv', sep='\t',nrows = 15000)
-
- train_test = TfidfVectorizer(ngram_range=(1,3),max_features = 3000).fit_transform(df.text)
-
- X_train,X_val,y_train,y_val = train_test_split(train_text,df.label,test_size = 0.3)
-
-
- clf = RidgeClassifier()
- clf.fit(X_train,y_train)
- val_pred = clf.predict(X_test)
- print(f1_score(y_val,val_pred,average = 'macro'))
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。