当前位置: article > 正文

写一个TF-IDF模型_tfhydfhf

作者：菜鸟追梦旅行 | 2024-04-04 09:40:48

踩

tfhydfhf

这个小项目是跟一位广工的小伙伴一起完成的，他负责提供数据和这个模型的理论，没办法，谁让咱数学底子薄呢，我就是负责一下代码实现就完事了

模型理论
这个模型的基础理论其实不难
假设我有一千篇论文，通过数据清洗，分词等操作，我得到了关键词数据。
然后从这些关键词中，再次去找关键词。常理来说，一个词出现的频率越高那么这个词就越关键。然如果只有这一篇论文出现了这么一个词，而且满篇都是这一个词，这个词被恶意刷屏了，怎么办。那么我们就在引入一个量，log（总文章/总文章中含有这个词的文章数），咱也不晓得这个模型的大佬是怎么想的用这个方法，可能这就是奇才吧。然后我们把这个词频和那个数相乘，那么就得到了我们需要的量 TF-IDF模型量，通过这个量，我们得知哪个关键字最关键。
代码实验
来来来，直接代码搞一波
首先做个声明，所有的数据已经清洗，分词完毕，统一放在一个txt文件中，那么有人问了，都放在了同一个文件中，那怎么算后面那个量呢。很简单，我们可以把获取的这个大数组分成若干个小数组，模拟有这么多篇论文，因为是实验，总有些不严谨的地方。


```python
from collections import Counter

import math
import operator

"""
读取txt函数
path：txt路径
encod：编码
"""
def read_txt(path,encod):
    file = open(path, 'r',encoding = encod)
    return file

"""
将已经分好词的文本文件转换成数组
"""
def txtToList(file):
    list = []
    num = 0
    contents = file.readlines()
    for content in contents:
        content = content.split(',')
        for word in content:
            list.append(word)
            num = num + 1
    print("当前文本词汇个数： ",num)
    return list,num

"""
将list数据存入txt文件中
"""
def listToTxt(put_list,fileName):
    f = open(fileName, "w",encoding="utf-8")
    for list_mem in put_list:
        f.write(list_mem + ",")
    f.close()


"""
将数组按照等量划分
"""
def arr_size(arr,size):
    s=[]
    for i in range(0,len(arr)+1,size):
        c=arr[i:i+size]
        s.append(c)
    return s


# 计算词频
def func_counter(word_list):
    count_result = Counter(word_list)
    return count_result


# 寻找某个关键词的词数
def fidc_wf(word,wfs_Dict):
    word_num = wfs_Dict[word]
    return word_num
#关键词在这篇文章中是否出现
def findw_article(word,article):
    for str in article:
        if( word == str):
            return True
    return False


# 查出包含该词的文档数
def wordinfilecount(i_word, put_lists):
    count = 0  # 计数器
    for train_list in put_lists:
        if (findw_article(i_word, train_list)):
            count = count + 1
    #print("关键字在" + str(count) + "篇q文章中出现过")
    return count


# 计算TF-IDF,并返回字典
def tf_idf(dataList,putLists,num):
    tf = 0
    idf = 0
    dic = func_counter(dataList) #获取每个关键词的出现次数
    outdic = dic
    for word in dic.keys():
        tf = fidc_wf(word,dic)/num #计算关键词词频
        idf = math.log(len(putLists)/(wordinfilecount(word,putLists)+1)) #计算idf
        tfidf = tf * idf # 计算综合
        outdic[word] = tfidf #写入键值对
    orderdic = sorted(outdic.items(), key=operator.itemgetter(1), reverse=True)  # 给字典排序
    return orderdic



# 读取文件
file = read_txt(自己的文件路径,'utf-8')
# 将文件转化为list数组
list,num = txtToList(file)
# 将数组按照1000的个数划分成若干个小数组
put_lists = arr_size(list,1000)

#调用主函数
print(tf_idf(list,put_lists,num))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/菜鸟追梦旅行/article/detail/357799