赞
踩
为了分词效果更好,可以自己建立分词词典加入到jieba词典中:
jieba.load_userdict()
()内为分词词典路径+名称。
写一个分词的小函数:
- import jieba
- def preprocess(path):
- text = ""
- fenci = open(path, "r", encoding="utf-8").read()
- jieba.load_userdict("C:/Users/idmin/Desktop/dict.txt")
- fenci = jieba.cut(fenci)
- #fenci = "/".join(fenci)
- for word in fenci:
- text=text+word
- return text
- print(preprocess('C:/Users/idmin/Desktop/one.txt'))
-
- '''
- #或以下程序
- import jieba
- def preprocess(path):
- text = ""
- fenci = open(path, "r", encoding="utf-8").read()
- jieba.load_userdict("C:/Users/idmin/Desktop/dict.txt")
- fenci = jieba.cut(fenci)
- fenci = "/".join(fenci)
- #for word in fenci:
- # text=text+word
- return fenci
- print(preprocess('C:/Users/idmin/Desktop/one.txt'))
- '''
one.txt内容为:
分词词典dict.txt内容为:
分词词典的格式要为“utf-8”.(另存为即可。)
加入分词词典前,分词效果如下:
/你好/您好/python/中/jieba/分词/快速/入门/落叶/数据挖掘/新浪/博客
加入后,效果为:
/你好/您好/python/中/jieba/分词/快速入门/落叶/数据挖掘/新浪/博客
“快速入门”没有被分开哦。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。