TF-IDF用以评估一个字词对于一个文件集或一个语料库中的其中一份文件的重要程度,通常用于提取文本的特征,即关键词。 字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。(即:待对比的文件中某词出现的次数多,那么这个词就越重要,但语料库中若此词语出现次数太多,那这个词就越不重要了,因为大家都有,降低了此词语的独特性)


tfidf = tf * idf

其中tf是词频,代表一个文章中某词的出现频率。如某文章有共100个词,而某词出现的次数为12次,则tf = 12/100 = 0.12
idf = log2(n/k).

text1 ="""
Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. 
Unqualified, the word football is understood to refer to whichever form of football is the most popular 
in the regional context in which the word appears. Sports commonly called football in certain places 
include association football (known as soccer in some countries); gridiron football (specifically American 
football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union); 
and Gaelic football. These different variations of football are known as football codes.

text2 = """
Basketball is a team sport in which two teams of five players, opposing one another on a rectangular court, 
compete with the primary objective of shooting a basketball (approximately 9.4 inches (24 cm) in diameter) 
through the defender's hoop (a basket 18 inches (46 cm) in diameter mounted 10 feet (3.048 m) high to a backboard 
at each end of the court) while preventing the opposing team from shooting through their own hoop. A field goal is 
worth two points, unless made from behind the three-point line, when it is worth three. After a foul, timed play stops 
and the player fouled or designated to shoot a technical foul is given one or more one-point free throws. The team with 
the most points at the end of the game wins, but if regulation play expires with the score tied, an additional period 
of play (overtime) is mandated.

text3 = """
Volleyball, game played by two teams, usually of six players on a side, in which the players use their hands to bat a 
ball back and forth over a high net, trying to make the ball touch the court within the opponents’ playing area before 
it can be returned. To prevent this a player on the opposing team bats the ball up and toward a teammate before it touches 
the court surface—that teammate may then volley it back across the net or bat it to a third teammate who volleys it across 
the net. A team is allowed only three touches of the ball before it must be returned over the net.
import math
from nltk.corpus import stopwords     #停用词
from gensim import corpora, models, matutils

#training by gensim's Ifidf Model
def get_words(text):
    tokens = get_tokens(text)
    filtered = [w for w in tokens if not w in stopwords.words('english')]
    return filtered

# get text
count1, count2, count3 = get_words(text1), get_words(text2), get_words(text3)
countlist = [count1, count2, count3]

# print(countlist)

def tf(word,text):
    return text.count(word)/len(text)

def idf(word,countlist):
    sum_ = sum(1 for text in countlist if word in text)
    return math.log2(len(countlist)/sum_)

def tfidf(word,text,countlist):
    return tf(word,text)*idf(word,countlist)

for i, doc in enumerate(countlist, 1):
    print(f"Top words in document{i}")
    scores = {word: tfidf(word,doc,countlist) for word in doc}
    sorted_scores = sorted(scores.items(),key=lambda x:x[1],reverse=True)
    for word,score in sorted_scores[:3]:
        print(f'{word},    TF-IDF:{round(score,5)}')
from nltk.corpus import stopwords     #停用词
from gensim import corpora, models, matutils

# 文本预处理
# 函数:text文件分句,分词,并去掉标点
def get_tokens(text):
    text = text.replace('\n', '')
    sents = nltk.sent_tokenize(text)  # 分句
    tokens = []
    for sent in sents:
        for word in nltk.word_tokenize(sent):  # 分词
            if word not in string.punctuation: # 去掉标点
    return tokens


#training by gensim's Ifidf Model
def get_words(text):
    tokens = get_tokens(text)
    filtered = [w for w in tokens if not w in stopwords.words('english')]
    return filtered

# get text
count1, count2, count3 = get_words(text1), get_words(text2), get_words(text3)
countlist = [count1, count2, count3]

# 1. 创建字典
dictionary = corpora.Dictionary(countlist)

# 转化成 id :word ,方便最后通过id查找词语。
new_dict = {v:k for k,v in dictionary.token2id.items()}

# 2. 生成语料库 (单词:出现次数)
corpus2 = [dictionary.doc2bow(count) for count in countlist]

# 3. 向TfidfModel模型传入语料库 得到结果
tfidf2 = models.TfidfModel(corpus2)

# 4. 结果的[语料库]  获取到corpus2的tfidf
corpus_tfidf = tfidf2[corpus2]

# output
print("\nTraining by gensim Tfidf Model.......\n")
for i, doc in enumerate(corpus_tfidf, 1):
    print(f"Top words in document{i}")
    sorted_words = sorted(doc, key=lambda x: x[1], reverse=True)    #type=list
    for num, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(new_dict[num], round(score, 5)))
from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()
# lemmatize nouns
print(wnl.lemmatize('cars', 'n'))
print(wnl.lemmatize('men', 'n'))

# lemmatize verbs
print(wnl.lemmatize('running', 'v'))
print(wnl.lemmatize('ate', 'v'))

# lemmatize adjectives
print(wnl.lemmatize('saddest', 'a'))
print(wnl.lemmatize('fancier', 'a'))
在上述代码中,wnl.lemmatize( )方法可以提供词性还原的功能,第一个参数是单词,第二个参数是词性,包括名词n、动词v、形容词a等,词性不能写错,不然得不到期望的原型。


在NLP中,使用Parts of speech(POS)技术实现。在nltk中,可以使用nltk.pos_tag()获取单词在句子中的词性,如以下Python代码:

sentence = 'The brown fox is quick and he is jumping over the lazy dog'
import nltk
tokens = nltk.word_tokenize(sentence)
tagged_sent = nltk.pos_tag(tokens)
[(‘The’, ‘DT’), (‘brown’, ‘JJ’), (‘fox’, ‘NN’), (‘is’, ‘VBZ’), (‘quick’, ‘JJ’), (‘and’, ‘CC’), (‘he’, ‘PRP’), (‘is’, ‘VBZ’), (‘jumping’, ‘VBG’), (‘over’, ‘IN’), (‘the’, ‘DT’), (‘lazy’, ‘JJ’), (‘dog’, ‘NN’)]


from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

sentence = 'The brown fox is quick and he is jumping over the lazy dog'

def get_wordtag(tag):
    if tag.startswith('J'):
        return  wordnet.ADJ
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('R'):
        return wordnet.ADV
        return None
tokens = nltk.word_tokenize(sentence)
tagged_sent = nltk.pos_tag(tokens)

wnl = WordNetLemmatizer()
lemmas_sent = []
for word,tag in tagged_sent:
    wordtag = get_wordtag(tag) or wordnet.NOUN
    #print(word+" "+wordtag)
[‘The’, ‘brown’, ‘fox’, ‘is’, ‘quick’, ‘and’, ‘he’, ‘is’, ‘jumping’, ‘over’, ‘the’, ‘lazy’, ‘dog’]
[‘The’, ‘brown’, ‘fox’, ‘be’, ‘quick’, ‘and’, ‘he’, ‘be’, ‘jump’, ‘over’, ‘the’, ‘lazy’, ‘dog’]



