赞
踩
单词在一句话中出现的频数,向量化表示,句子——>数值向量。
优点:简单易理解。
缺点:当语料库很大时,向量化后可能会是稀疏矩阵,后续计算会出现麻烦。
优化:可用出现最频繁的词来构建语料库。
例子:
句子1:这只皮靴号码大了,那只号码合适
句子2:这只皮靴号码不小,那只更合适
- import pandas as pd
- import math
- import jieba
-
- # 用jieba进行分词
- textA = '这只皮靴号码大了,那只号码合适'
- textB = '这只皮靴号码不小,那只更合适'
- bowA = list(jieba.cut(textA))
- # print(bowA)
- print('/'.join(bowA))
- bowA.remove(',')
- # print(bowA)
- bowB = list(jieba.cut(textB))
- print('/'.join(bowB))
- bowB.remove(',')
- list_ = [bowA, bowB]
- #print(list_)
-
- # 构建语料库
- word_set = set(bowA).union(set(bowB))
- print(word_set)
-
- word_dictA = dict.fromkeys(word_set, 0)
- word_dictB = dict.fromkeys(word_set, 0)
-
- # 构建语料库中单词所对应的索引
- word_index_dict = {}
- for index, word in enumerate(word_set):
- word_index_dict[word]=index
- #print(word_index_dict)
-
- # 计算count vector
- count_vector = []
- for text in list_:
- vector_list=[0]*len(word_set)
- for word in text:
- vector_list[word_index_dict[word]]+=1
- #print(vector_list)
- count_vector.append(vector_list)
- print('count vector:',count_vector)
TF-IDF(词频-逆文件频率):对于一个word,在文档出现的频率高,但在语料库里出现频率低,那么这个word对该文档的重要性比较高,用于找出文章中的关键词。
优点:简单快速,好理解。
缺点:单纯用词频来衡量重要性,忽略了单词在文章中位置带来的影响,不够全面。
TF(词频计算方法):
(1)某个词在文章中出现的次数 / 文章的总词数(常用)
(2)某个词在文章中出现的次数 / 该文出现次数最多词的出现次数
IDF (逆文件频率计算方法):
反文档频率(IDF)= log( 语料库的文档总数 / (包含该词的文档数 + 1) )
TF-IDF:
TF-IDF= TF*IDF
例子:
句子1:这只皮靴号码大了,那只号码合适
句子2:这只皮靴号码不小,那只更合适
- import pandas as pd
- import math
- import jieba
- import numpy as np
-
- # 用jieba进行分词
- textA = '这只皮靴号码大了,那只号码合适'
- textB = '这只皮靴号码不小,那只更合适'
- bowA = list(jieba.cut(textA))
- # print(bowA)
- '/'.join(bowA)
- bowA.remove(',')
- #print(bowA)
- bowB = list(jieba.cut(textB))
- '/'.join(bowB)
- bowB.remove(',')
-
- # 构建语料库
- word_set=set(bowA).union(set(bowB))
- #print(word_set)
-
- word_dictA = dict.fromkeys(word_set,0)
- word_dictB = dict.fromkeys(word_set,0)
-
- for word in bowA:
- word_dictA[word] += 1
-
- for word in bowB:
- word_dictB[word] += 1
- print('出现次数:\n',pd.DataFrame([word_dictA,word_dictB]))
-
- # 计算TF
- def computeTF(word_dict,bow):
-
- tf_Dict = {}
- bowcount = len(bow)
-
- for key,values in word_dict.items():
- tf_Dict[key] = values / float(bowcount)
- return tf_Dict
-
- tfbowa = computeTF(word_dictA,bowA)
- tfbowb = computeTF(word_dictB,bowB)
- print('TF:\n',pd.DataFrame([tfbowa,tfbowb]))
-
- # 计算IDF
- def computeIDF(dolist):
-
- idf_dict={}
-
- # 语料库文档总数
- n=len(dolist)
-
- # 初始化idf字典
- idf_dict = dict.fromkeys(dolist[0].keys(),0)
-
- # 包含该词文档数
- for doc in dolist:
- for word, val in doc.items():
- if val>0:
- idf_dict[word] += 1
- # 计算idf
- for word, val in idf_dict.items():
- idf_dict[word]=math.log10(n+10/(float(val)+1))
-
- print('文档总数:',n)
- #print('IDF:\n',idf_dict)
- return idf_dict
-
- idf = computeIDF([word_dictA,word_dictB])
- print('IDF:\n',pd.DataFrame([idf]))
-
- # 计算TF-IDF
- def computeTF_IDF(tfbow, idf):
- tfidf={}
-
- for word,val in tfbow.items():
- tfidf[word] = idf[word]*val
- return tfidf
-
- tfidf_A=computeTF_IDF(tfbowa,idf)
- tfidf_B=computeTF_IDF(tfbowb,idf)
-
- print('TF-IDF:\n',pd.DataFrame([tfidf_A,tfidf_B]))
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。