赞
踩
中文分词的概念与分类
常用分词(规则分词、统计分词、混合分词)技术介绍
开源中文分词工具-Jieba
实战分词之高频词提取
规则分词
最早兴起,主要通过人工设立词库,按照一定方式进行匹配切分,实现简单高效,但对新词难以处理;
统计分词
能较好应对新词发现场景,但是太过于依赖于语料质量;
混合分词
规则分词与统计分词的结合体;
定义
一种机械分词方法,主要通过维护词典,切分语句时,将语句中的每个字符串与词表中的词逐一匹配,找到则切分,否则不切分;
分类
基本思想
假定分词词典中最长词有 i i i个汉字字符,则用被处理文档的当前字符串中的前 i i i个字作为匹配字段,查找字典;
算法描述
基本原理
从被处理文档末端开始匹配扫描,每次取末端的 i i i个字符作为匹配字段,匹配事变则去掉匹配字段最前一个字,继续匹配;
基本原理
将正向最大匹配法和逆向最大匹配法得到的分词结果进行比较,庵后按照最大匹配原则,选取词数切分最少的作为结果;
相关代码
#!/usr/bin/env python # -*- coding: utf-8 -*- # @version : 1.0 # @Time : 2019-8-25 15:27 # @Author : cunyu # @Email : cunyu1024@foxmail.com # @Site : https://cunyu1943.github.io # @File : mm.py # @Software: PyCharm # @Desc : 正向最大匹配分词 train_data = './data/train.txt' # 训练语料 test_data = './data/test.txt' # 测试语料 result_data = './data/test_sc_zhengxiang.txt' # 生成结果 def get_dic(train_data): # 读取文本返回列表 with open(train_data, 'r', encoding='utf-8', ) as f: try: file_content = f.read().split() finally: f.close() chars = list(set(file_content)) return chars def MM(test_data, result_data, dic): # 词的最大长度 max_length = 5 h = open(result_data, 'w', encoding='utf-8', ) with open(test_data, 'r', encoding='utf-8', ) as f: lines = f.readlines() for line in lines: # 分别对每行进行正向最大匹配处理 max_length = 5 my_list = [] len_hang = len(line) while len_hang > 0: tryWord = line[0:max_length] while tryWord not in dic: if len(tryWord) == 1: break tryWord = tryWord[0:len(tryWord) - 1] my_list.append(tryWord) line = line[len(tryWord):] len_hang = len(line) for t in my_list: # 将分词结果写入生成文件 if t == '\n': h.write('\n') else: print(t) h.write(t + " ") h.close() if __name__ == '__main__': print('读入词典') dic = get_dic(train_data) print('开始匹配') MM(test_data, result_data, dic)
#!/usr/bin/env python # -*- coding: utf-8 -*- # @version : 1.0 # @Time : 2019-8-25 15:36 # @Author : cunyu # @Email : cunyu1024@foxmail.com # @Site : https://cunyu1943.github.io # @File : rmm.py # @Software: PyCharm # @Desc : 逆向最大匹配法 train_data = './data/train.txt' test_data = './data/test.txt' result_data = './data/test_sc.txt' def get_dic(train_data): with open(train_data, 'r', encoding='utf-8', ) as f: try: file_content = f.read().split() finally: f.close() chars = list(set(file_content)) return chars def RMM(test_data,result_data,dic): max_length = 5 h = open(result_data, 'w', encoding='utf-8', ) with open(test_data, 'r', encoding='utf-8', ) as f: lines = f.readlines() for line in lines: my_stack = [] len_hang = len(line) while len_hang > 0: tryWord = line[-max_length:] while tryWord not in dic: if len(tryWord) == 1: break tryWord = tryWord[1:] my_stack.append(tryWord) line = line[0:len(line) - len(tryWord)] len_hang = len(line) while len(my_stack): t = my_stack.pop() if t == '\n': h.write('\n') else: print(t) h.write(t + " ") h.close() if __name__ == '__main__': print('获取字典') dic = get_dic(train_data) print('开始匹配……') RMM(test_data,result_data,dic)
主要操作
n元条件概率
P ( w i ∣ w i − ( n − 1 ) , … , w i − 1 ) = c o u n t (
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。