python的jieba库使用_pythonjieba库怎么用

作者：盐析白兔 | 2024-03-31 08:55:48

踩

pythonjieba库怎么用

python

1.jieba库

jieba是python中一个重要的第三方中文分词函数库。
jieba库常用的分词函数

函数	描述
jieba.cut(s)	精确模式，返回一个可迭代的数据类型
jieba.cut(s,cut_all=True)	全模式，输出文本s中所有可能的单词
jieba.cut_for_search(s)	搜索引擎模式，适合搜索引擎建立索引的分词结果
jieba.lcut(s)	精确模式，返回一个列表类型
jieba.luct(s,cut_all=True)	全模式，返回一个列表类型
jieba.lcut_for_search(s)	搜索引擎模式，返回一个列表类型
jieba.add_word(w)	向分词词典中添加新词w

实例：文本词频统计
该问题的IPO描述：
输入：从文件中读取一篇文章
处理：采用字典数据结构统计词语出现频率
输出：文章中最常出现的10个单词及出现次数

英文文本统计词频

# 停用词 
excludes={"the","and","to","of","you","a","i","my","in"}
# 读入文本
def getText():
    txt=open("hamlet.txt","r").read()
    txt.lower()
    for ch in '!@#$%^&*()_+-={}[]\|:"<>?;,./':
        txt=txt.replace(ch," ")#将标点符号替换成空格
    return txt
hamletTxt=getText()
# 分词
words=hamletTxt.split()
# 统计词频
counts={} ##字典类型，存储词频
for word in words:
    counts[word]=counts.get(word,0)+1
# 去除停用词
for word in excludes:
    del(counts[word])
# 排序
items=list(counts.items())#转换成列表，便于排序
items.sort(key=lambda x:x[1],reverse=True)
"""
sort()排序函数：对列表元素进行排序
reverse可选。reverse=True 将对列表进行降序排序。
key可选。指定排序标准的函数。 
"""
# 输出
for i in range(10):
    word,count=items[i]
    print("{0:<10}{1:>5}".format(word,count))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

中文文本统计词频

#导入jieba库
import jieba 
# 停用词
excludes={"将军","却说","荆州","二人","不可","不能","如此","如何","商议","主公"}
# 读入文本
txt=open("三国演义.txt","r",encoding='utf-8').read()
# 分词
words=jieba.lcut(txt)
# 统计词频
counts={}
for word in words:
    if len(word)==1: # 排除单个字符的分词结果
        continue
    elif word=="诸葛亮" or word=="孔明曰":
        rword="孔明" # 同一个人合并
    elif word=="关公" or word=="云长":
        rword="关羽"
    elif word=="玄德" or word=="玄德曰":
        rword="刘备"
    elif word=="孟德" or word=="丞相":
        rword="曹操"
    else:
        rword=word # 其他词照搬
    counts[rword]=counts.get(rword,0)+1
# 去除停用词
for word in excludes:
    del(counts[word])
# 排序
items=list(counts.items())
items.sort(key=lambda x:x[1],reverse=True)
# 输出
for i in range(5):
    word,count=items[i]
    print("{0:{2}<10}{1:{2}>5}".format(word,count,chr(12288))) #chr(12288)用于对其文本
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/盐析白兔/article/detail/344035