赞
踩
CBOW模型是输入某个特征词的上下文,推测特征值。即给出上下文的词向量,推测出中心词。
an efficient method for learning high quality distributed vertor
例如这段话,上下文的取值都是4,learning是中心词,所以上下文的8个词向量是模型的输入,learning是模型的输出词向量。
CBOW使用的是词袋模式,因此不考虑这8个词与我们中心词的距离大小,每个都是平等的,只要在上下文范围内即可。
在这个CBOW例子中,输入是8个词向量,输出是所有词的softmax概率,我们期望训练样本特定词对应的softmax最大。对应的输入层有8个神经元,输出层有对应的有词汇表大小个神经元,隐层的大小可以随意设定,搭建DNN模型利用反向传播优化参数。
Skip-gram模型的设计理念和CBOW正好是反过来的,还是上面的例子,输入为learning,输出为其余的8个单词。即:输入是特定词,输出是softmax概率排名前八的单词.
在TF-IDF中的计算公式如下:
tfidf = tf * idf
其中tf是词频,代表一个文章中某词的出现频率。如某文章有共100个词,而某词出现的次数为12次,则tf = 12/100 = 0.12
idf是逆向文件频率,可以代表某词的独特性。如整个文档有n篇文章,其中k篇文章中都出现过这个词,那么这个词的
idf = log2(n/k).
text1 =""" Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. Unqualified, the word football is understood to refer to whichever form of football is the most popular in the regional context in which the word appears. Sports commonly called football in certain places include association football (known as soccer in some countries); gridiron football (specifically American football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union); and Gaelic football. These different variations of football are known as football codes. """ text2 = """ Basketball is a team sport in which two teams of five players, opposing one another on a rectangular court, compete with the primary objective of shooting a basketball (approximately 9.4 inches (24 cm) in diameter) through the defender's hoop (a basket 18 inches (46 cm) in diameter mounted 10 feet (3.048 m) high to a backboard at each end of the court) while preventing the opposing team from shooting through their own hoop. A field goal is worth two points, unless made from behind the three-point line, when it is worth three. After a foul, timed play stops and the player fouled or designated to shoot a technical foul is given one or more one-point free throws. The team with the most points at the end of the game wins, but if regulation play expires with the score tied, an additional period of play (overtime) is mandated. """ text3 = """ Volleyball, game played by two teams, usually of six players on a side, in which the players use their hands to bat a ball back and forth over a high net, trying to make the ball touch the court within the opponents’ playing area before it can be returned. To prevent this a player on the opposing team bats the ball up and toward a teammate before it touches the court surface—that teammate may then volley it back across the net or bat it to a third teammate who volleys it across the net. A team is allowed only three touches of the ball before it must be returned over the net. """ import math from nltk.corpus import stopwords #停用词 from gensim import corpora, models, matutils #training by gensim's Ifidf Model def get_words(text): tokens = get_tokens(text) #去掉停用词 filtered = [w for w in tokens if not w in stopwords.words('english')] return filtered # get text count1, count2, count3 = get_words(text1), get_words(text2), get_words(text3) countlist = [count1, count2, count3] # print(countlist) def tf(word,text): return text.count(word)/len(text) def idf(word,countlist): sum_ = sum(1 for text in countlist if word in text) return math.log2(len(countlist)/sum_) def tfidf(word,text,countlist): return tf(word,text)*idf(word,countlist) for i, doc in enumerate(countlist, 1): print(f"Top words in document{i}") scores = {word: tfidf(word,doc,countlist) for word in doc} sorted_scores = sorted(scores.items(),key=lambda x:x[1],reverse=True) for word,score in sorted_scores[:3]: print(f'{word}, TF-IDF:{round(score,5)}')
from nltk.corpus import stopwords #停用词 from gensim import corpora, models, matutils # 文本预处理 # 函数:text文件分句,分词,并去掉标点 def get_tokens(text): text = text.replace('\n', '') sents = nltk.sent_tokenize(text) # 分句 tokens = [] for sent in sents: for word in nltk.word_tokenize(sent): # 分词 if word not in string.punctuation: # 去掉标点 tokens.append(word) return tokens print(get_tokens(text1)) #training by gensim's Ifidf Model def get_words(text): tokens = get_tokens(text) #去掉停用词 filtered = [w for w in tokens if not w in stopwords.words('english')] return filtered # get text count1, count2, count3 = get_words(text1), get_words(text2), get_words(text3) countlist = [count1, count2, count3] # 1. 创建字典 dictionary = corpora.Dictionary(countlist) # 转化成 id :word ,方便最后通过id查找词语。 new_dict = {v:k for k,v in dictionary.token2id.items()} # 2. 生成语料库 (单词:出现次数) corpus2 = [dictionary.doc2bow(count) for count in countlist] # 3. 向TfidfModel模型传入语料库 得到结果 tfidf2 = models.TfidfModel(corpus2) # 4. 结果的[语料库] 获取到corpus2的tfidf corpus_tfidf = tfidf2[corpus2] # output print("\nTraining by gensim Tfidf Model.......\n") for i, doc in enumerate(corpus_tfidf, 1): print(f"Top words in document{i}") sorted_words = sorted(doc, key=lambda x: x[1], reverse=True) #type=list for num, score in sorted_words[:3]: print(" Word: %s, TF-IDF: %s"%(new_dict[num], round(score, 5)))
比如,单词“cars”词形还原后的单词为“car”,单词“ate”词形还原后的单词为“eat”。
在Python的nltk模块中,使用WordNet为我们提供了稳健的词形还原的函数:
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()
# lemmatize nouns
print(wnl.lemmatize('cars', 'n'))
print(wnl.lemmatize('men', 'n'))
# lemmatize verbs
print(wnl.lemmatize('running', 'v'))
print(wnl.lemmatize('ate', 'v'))
# lemmatize adjectives
print(wnl.lemmatize('saddest', 'a'))
print(wnl.lemmatize('fancier', 'a'))
在上述代码中,wnl.lemmatize( )方法可以提供词性还原的功能,第一个参数是单词,第二个参数是词性,包括名词n、动词v、形容词a等,词性不能写错,不然得不到期望的原型。
在NLP中,使用Parts of speech(POS)技术实现。在nltk中,可以使用nltk.pos_tag()获取单词在句子中的词性,如以下Python代码:
sentence = 'The brown fox is quick and he is jumping over the lazy dog'
import nltk
tokens = nltk.word_tokenize(sentence)
tagged_sent = nltk.pos_tag(tokens)
print(tagged_sent)
[(‘The’, ‘DT’), (‘brown’, ‘JJ’), (‘fox’, ‘NN’), (‘is’, ‘VBZ’), (‘quick’, ‘JJ’), (‘and’, ‘CC’), (‘he’, ‘PRP’), (‘is’, ‘VBZ’), (‘jumping’, ‘VBG’), (‘over’, ‘IN’), (‘the’, ‘DT’), (‘lazy’, ‘JJ’), (‘dog’, ‘NN’)]
N开头的是名词,J开头的形容词,V开头的是动词,R开头的是副词。
from nltk import word_tokenize, pos_tag from nltk.corpus import wordnet from nltk.stem import WordNetLemmatizer sentence = 'The brown fox is quick and he is jumping over the lazy dog' def get_wordtag(tag): if tag.startswith('J'): return wordnet.ADJ elif tag.startswith('N'): return wordnet.NOUN elif tag.startswith('V'): return wordnet.VERB elif tag.startswith('R'): return wordnet.ADV else: return None tokens = nltk.word_tokenize(sentence) tagged_sent = nltk.pos_tag(tokens) wnl = WordNetLemmatizer() lemmas_sent = [] for word,tag in tagged_sent: wordtag = get_wordtag(tag) or wordnet.NOUN #print(word+" "+wordtag) lemmas_sent.append(wnl.lemmatize(word,wordtag)) print(tokens) print(lemmas_sent)
[‘The’, ‘brown’, ‘fox’, ‘is’, ‘quick’, ‘and’, ‘he’, ‘is’, ‘jumping’, ‘over’, ‘the’, ‘lazy’, ‘dog’]
[‘The’, ‘brown’, ‘fox’, ‘be’, ‘quick’, ‘and’, ‘he’, ‘be’, ‘jump’, ‘over’, ‘the’, ‘lazy’, ‘dog’]
可以看到i,我们已经可以做到把一句话分词,然后根据词性做做词形还原,得到还原后的单词列表。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。