自然语言处理-LDA建模代码_词表pkl

作者：盐析白兔 | 2024-06-14 22:46:02

踩

词表pkl

第一次尝试使用markdown 编辑器, 咔咔咔咔

本篇博客记录之前做项目时使用自然语言处理方法LDA的一些方法，希望能够帮到大家。

文章目录

1.LDA模型构造概述：
2.提炼训练文本
3.对文本通过停用词表后进行分词
4.训练LDA模型
5.对模型进行评价
6.其他操作
将停用词表打包为pkl格式
通过LDA模型得到每条文本所属的类别

1.LDA模型构造概述：

整体上来说分为以下几个步骤：

提炼训练文本
对文本通过停用词表后进行分词
训练LDA模型（需要前两步的支持）
对模型进行评价

有部分与LDA建模无关的操作或代码都放在了后面的“其他操作”模块

2.提炼训练文本

前两部分核心代码是：

iter_f = iter(open(file_path)) #打开文本
for line in iter_f: #循环文本的每一行
    content = '' #这里根据实际情况进行数据预处理，确保content最后只有需要训练的语句就行。
    # 将提炼好的文本保存下来。
    write_file_name = '训练文本.txt'
    with open(write_file_name, 'a') as f:
        f.writelines(str(content) + "\r\n")
        f.close()
1
2
3
4
5
6
7
8

3.对文本通过停用词表后进行分词

我们使用的停用词表是pkl的格式停用词表txt与pkl转换操作点这里

这是分词的核心方法代码：

import pickle
import jieba # pip install jieba
import jieba.analyse

def drop_stopwords(line_contents, stopwords):
    """删除存在于停用词表和自定义过滤的词中的词，返回有意义的词"""
    line_clean = []
    costom_remove_list = ['自定义过滤的词']
    for word in line_contents:
        word = word.strip()
        if len(word) < 2:
            continue
        if (word in stopwords) or (not word) or (word in costom_remove_list):
            continue
        line_clean.append(word)
    return line_clean
    
def get_seg_content(content, stopwords_file=''):
    if not stopwords_file:#有时需要用到不同的停用词表，所以这里可以指定一个停用词表
        stopwords = pickle.load(
            open('stopwords.pkl', 'rb'))
    segment = jieba.lcut(str(content).strip())
    if len(segment) < 1 or segment == '\r' or segment == '\n' or segment == ' ' or segment == '\t':
        return ''
    seg_content = drop_stopwords(segment, stopwords_file)
    return seg_content
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

4.训练LDA模型

这是训练LDA模型的方法：

from gensim import corpora
import gensim # pip install gensim

def get_contents_clean(all_contents):
    """对训练文本中的每一行都分词后返回 list to list 格式的文本"""
    contents_clean = []
    for content in all_contents: 
        clean_content = get_seg_content(content)
        if clean_content:
            contents_clean.append(clean_content)
    return contents_clean

def get_topic(num_topic=10):
    # num_topic 定义LDA模型需要训练成多少类
    try:

        def lda_analyze(all_contents, num_topic=10):
        """这是训练LDA的核心方法"""
            contents_clean = get_contents_clean(all_contents)
            # contents_clean type is list to list!
            dictionary = corpora.Dictionary(contents_clean)
            corpus = [dictionary.doc2bow(sentence) for sentence in contents_clean]
            lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topic) #核心代码
            return lda

        # 读取训练文本
        all_contents = list(iter(open('训练文本.txt')))  
        # all_contents is all text
        lda = lda_analyze(all_contents, num_topic=num_topic)
        for topic in lda.print_topics(num_words=20): # 这里是打印LDA分类的结果
            print(topic[1])
        # save model
        lda.save('lda_'+ str(num_topic) + '.model')
    except Exception as e:
        print(e)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

由于在开始训练时，我们并不知道最佳的num_topic 是多少，因此可以先在服务器上测试一下：

for i in range(16):
    get_topic(i+1) # 从分为1个类别到16个类别，都跑一跑，然后把结果保存下来
1
2

这样显得有点笨重，但是方便理解，看懂代码之后可以结合对模型的评价这部分修改代码。

5.对模型进行评价

首先读取模型：

import gensim
def get_lda_model(model_path):
    model = gensim.models.ldamodel.LdaModel.load(model_path)
    return model
1
2
3
4

之后就是对模型进行评价了：

import gensim
from gensim.models.coherencemodel import CoherenceModel

lda_model = get_lda_model('lda.model') #得到LDA模型
# 下面很多代码是不是似曾相识？
all_contents = list(iter(open('训练文本.txt')))  
contents_clean = get_contents_clean(all_contents)
dictionary = gensim.corpora.Dictionary(contents_clean)
# 通过"主题相干性"这个指标来评判模型的好坏
badcm = CoherenceModel(model=lda, texts=contents_clean,dictionary=dictionary, coherence='c_v')
print(badcm.get_coherence())
# 可以对比多个lda_model打印出的值越接近大越好
1
2
3
4
5
6
7
8
9
10
11
12

6.其他操作

将停用词表打包为pkl格式

import pickle

def save_stop_words_list(stopwords_file):
    stopwords = open(stopwords_file)
    stopwords = list(stopwords)
    stword = []
    for words in stopwords:
        stword.append(words.replace('\t', '').replace('\n', ''))
    output = open('stopwords.pkl', 'wb')
    pickle.dump(stword, output)
    output.close()
1
2
3
4
5
6
7
8
9
10
11

使用时只需要使用:

save_stop_words_list('stopwords.txt')#传入停用词表的txt格式，就可以在当前位置生成pkl的格式
1

通过LDA模型得到每条文本所属的类别

def get_topic_list(lda_model, content):
    # lda_model 为LDA模型
    # content 为随便的一段文本
    content = get_seg_content(content)
    dictionary = gensim.corpora.Dictionary([content])
    corpus = dictionary.doc2bow(content)  # 文档转换成bow
    topic_list = lda_model.get_document_topics(corpus)  # 得到新文档的主题分布
    return topic_list
1
2
3
4
5
6
7
8

这里返回的topic_list会得到这段文本的所属类别及相关性，格式是：[(第m类,相关性值),(第n类，相关性值),…]

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/盐析白兔/article/detail/719917