TF-IDF(Term Frequency-Inverse Document Frequency)是一种用于评估文本中词语重要性的统计算法,它结合了词频(TF)和逆文档频率(IDF)两个指标,用于衡量一个词语在文档集中的重要程度。
- import nltk
- from nltk import FreqDist
- # 推荐系统的用户评价数据
- reviews = [
- "This movie is great!",
- "I love this movie so much.",
- "The acting in this film is superb.",
- "The plot of this movie is confusing.",
- "I didn't enjoy this film."
- ]
- # 将所有评价合并为一个字符串
- text = ' '.join(reviews)
- # 分词
- tokens = nltk.word_tokenize(text)
- # 计算词频
- freq_dist = FreqDist(tokens)
- # 输出词频统计结果
- for word, frequency in freq_dist.items():
- print(f"Word: {word}, Frequency: {frequency}")
- Word: This, Frequency: 1
- Word: movie, Frequency: 3
- Word: is, Frequency: 3
- Word: great, Frequency: 1
- Word: !, Frequency: 1
- Word: I, Frequency: 2
- Word: love, Frequency: 1
- Word: this, Frequency: 4
- Word: so, Frequency: 1
- Word: much, Frequency: 1
- Word: ., Frequency: 4
- Word: The, Frequency: 2
- Word: acting, Frequency: 1
- Word: in, Frequency: 1
- Word: film, Frequency: 2
- Word: superb, Frequency: 1
- Word: plot, Frequency: 1
- Word: of, Frequency: 1
- Word: confusing, Frequency: 1
- Word: did, Frequency: 1
- Word: n't, Frequency: 1
- Word: enjoy, Frequency: 1
逆文档频率(Inverse Document Frequency,简称IDF)是推荐系统中常用的一种特征权重计算方法。它衡量了一个词语在文本集合中的重要程度。在推荐系统中,逆文档频率通常与词频(Term Frequency,简称TF)结合使用,形成TF-IDF(Term Frequency-Inverse Document Frequency)特征表示。TF-IDF综合考虑了一个词语在当前文本中的重要程度(通过TF),以及它在整个文本集合中的普遍性和独特性(通过IDF)。
- import math
- from collections import Counter
- # 文本集合
- documents = [
- "This is the first document.",
- "This document is the second document.",
- "And this is the third one.",
- "Is this the first document?"
- ]
- # 分词并去重
- word_sets = [set(document.lower().split()) for document in documents]
- # 计算逆文档频率
- idf = {}
- num_documents = len(documents)
- for word in set(word for word_set in word_sets for word in word_set):
- count = sum(1 for word_set in word_sets if word in word_set)
- idf[word] = math.log(num_documents / (count + 1))
- # 输出逆文档频率
- for word, idf_value in idf.items():
- print(f"Word: {word}, IDF: {idf_value}")
在上述代码中,首先对每个文本进行分词,并去除重复的词语,得到一个词语集合。然后,我们遍历所有词语的集合,计算每个词语的逆文档频率。逆文档频率的计算公式是log(N / (n + 1)),其中N表示文本集合中的文档数,n表示包含当前词语的文档数。最后,打印输出每个词语及其对应的逆文档频率。执行后会输出:
- Word: this, IDF: -0.2231435513142097
- Word: third, IDF: 0.6931471805599453
- Word: second, IDF: 0.6931471805599453
- Word: document?, IDF: 0.6931471805599453
- Word: first, IDF: 0.28768207245178085
- Word: is, IDF: -0.2231435513142097
- Word: one., IDF: 0.6931471805599453
- Word: document, IDF: 0.6931471805599453
- Word: and, IDF: 0.6931471805599453
- Word: document., IDF: 0.28768207245178085
- Word: the, IDF: -0.2231435513142097
TF-IDF(Term Frequency-Inverse Document Frequency)是一种常用的特征权重计算方法,通过将词频与逆文档频率相乘得到的特征权重,用于衡量一个词语在文本中的重要性。TF-IDF能够突出在当前文本中频繁出现但在整个文本集合中相对稀缺的词语,因此可以捕捉到具有区分度和重要性的特征。在推荐系统中,TF-IDF常用于文本特征表示和相似度计算。例如下面是一个在Python程序中计算TF-IDF权重的例子。
- from sklearn.feature_extraction.text import TfidfVectorizer
- # 文本集合
- documents = [
- "This is the first document.",
- "This document is the second document.",
- "And this is the third one.",
- "Is this the first document?"
- ]
- # 创建TF-IDF向量化器
- vectorizer = TfidfVectorizer()
- # 对文本集合进行向量化
- tfidf_matrix = vectorizer.fit_transform(documents)
- # 输出词语和对应的TF-IDF权重
- feature_names = vectorizer.get_feature_names()
- for i in range(len(documents)):
- doc = documents[i]
- feature_index = tfidf_matrix[i, :].nonzero()[1]
- tfidf_scores = zip(feature_index, [tfidf_matrix[i, x] for x in feature_index])
- for word_index, score in tfidf_scores:
- print(f"Document: {doc}, Word: {feature_names[word_index]}, TF-IDF Score: {score}")
- Document: This is the first document., Word: document, TF-IDF Score: 0.46979138557992045
- Document: This is the first document., Word: first, TF-IDF Score: 0.5802858236844359
- Document: This is the first document., Word: the, TF-IDF Score: 0.38408524091481483
- Document: This is the first document., Word: is, TF-IDF Score: 0.38408524091481483
- Document: This is the first document., Word: this, TF-IDF Score: 0.38408524091481483
- Document: This document is the second document., Word: second, TF-IDF Score: 0.5386476208856763
- Document: This document is the second document., Word: document, TF-IDF Score: 0.6876235979836938
- Document: This document is the second document., Word: the, TF-IDF Score: 0.281088674033753
- Document: This document is the second document., Word: is, TF-IDF Score: 0.281088674033753
- Document: This document is the second document., Word: this, TF-IDF Score: 0.281088674033753
- Document: And this is the third one., Word: one, TF-IDF Score: 0.511848512707169
- Document: And this is the third one., Word: third, TF-IDF Score: 0.511848512707169
- Document: And this is the third one., Word: and, TF-IDF Score: 0.511848512707169
- Document: And this is the third one., Word: the, TF-IDF Score: 0.267103787642168
- Document: And this is the third one., Word: is, TF-IDF Score: 0.267103787642168
- Document: And this is the third one., Word: this, TF-IDF Score: 0.267103787642168
- Document: Is this the first document?, Word: document, TF-IDF Score: 0.46979138557992045
- Document: Is this the first document?, Word: first, TF-IDF Score: 0.5802858236844359
- Document: Is this the first document?, Word: the, TF-IDF Score: 0.38408524091481483
- Document: Is this the first document?, Word: is, TF-IDF Score: 0.38408524091481483
- Document: Is this the first document?, Word: this, TF-IDF Score: 0.38408524091481483
