赞
踩
项目链接:https://github.com/vi3k6i5/flashtext
原文:资源 | 十五分钟完成Regex五天任务:FastText,语料库数据快速清理利器
与一些其他的库进行对比:python | 关键词快速匹配检索小工具 pyahocorasick / ahocorapy
感觉,速度好像还是pyahocorasick
更快
安装:
pip install flashtext
正则表达式在一个 10k 的词库中查找 15k 个关键词的时间差不多是 0.165 秒。但是对于 Flashtext 而言只需要 0.002 秒。因此,在这个问题上 Flashtext 的速度大约比正则表达式快 82 倍。
随着我们需要处理的字符越来越多,正则表达式的处理速度几乎都是线性增加的。然而,Flashtext 几乎是一个常量。在本文中,我们将着重讨论正则表达式与 Flashtext 之间的性能区别。我们还将详细的描述 Flashtext 算法及其工作原理,和一些基准测试。
Flashtext 是一种基于 Trie 字典数据结构和 Aho Corasick 的算法。它的工作方式是,首先它将所有相关的关键字作为输入。使用这些关键字建立一个 trie 字典,如下图3所示:
start 和 eot 是两个特殊的字符,用来定义词的边界,这和我们上面提到的正则表达式是一样的。这个 trie 字典就是我们后面要用来搜索和替换的数据结构。
Flashtext 算法那主要分为三部分,我们接下来将对每一部分进行单独分析:
from flashtext import KeywordProcessor
keyword_processor=KeywordProcessor(case_sensitive=False)
keyword_processor.add_keyword(one_kw,)
keywords_found=keyword_processor.extract_keywords(one_str,span_info=True)
>>> ('健康', 6, 8)
其中:
当然,新增关键词还有很多招数:
匹配词归类
add_keyword(word,key)
word就会被归类到key,就像{'key':'word'}
,所以匹配到word,会直接显示key
keyword_processor.add_keyword('Taj Mahal', {1:1,2:2})
keyword_processor.add_keyword('Delhi', ('Location', 'Delhi'))
keyword_processor.add_keyword('Delhi', ['Location', 'Delhi'])
与字典一样的新增方式
keyword_processor['apple']='fruits'
可以与字典一样的新增,而与add_keyword(word,key)
一样的效果
批量新增 —— 字典和列表
keyword_dict={
"fruit": ["apple", "banana","orange","watermelon"],
"ball": ["tennis", "basketball","football"]
}
keyword_processor.add_keywords_from_dict(keyword_dict) # 可以添加dict
keyword_processor.add_keywords_from_list(["fruit", "banana"]) # 可以添加list
add_keywords_from_dict
与add_keyword(word,key)
一样,如果匹配到values,则会返回key
一般用的是:extract_keywords
还可以使用replace_keywords
keywords_found=keyword_processor.extract_keywords(one_str,span_info=True)
extract_keywords返回的是匹配到的关键词,而replace_keywords是直接返回一整个句子,相当于关键词定位 + 替换:
# 加载
kw_list=['健康','美味']
keyword_processor=KeywordProcessor()
for kl in kw_list:
keyword_processor.add_keyword(kl)
keyword_processor.add_keyword('健康','建康')
# 查询
text="这个菜,真是健康又美味,很健康"
new_sentence=keyword_processor.replace_keywords(text) # 替换式查询
print(new_sentence)
new_sentence=keyword_processor.extract_keywords(text) # 关键词检索
print(new_sentence)
>>> 这个菜,真是建康又美味,很建康
>>> ['建康', '美味', '建康']
'''
移除关键词
'''
keyword_processor.remove_keyword('banana')
keyword_processor.remove_keywords_from_dict({"food": ["bread"]})
keyword_processor.remove_keywords_from_list(["basketball"])
KeywordProcessor是trie树,可以:
len(keyword_processor) # 关键词长度
'LOVE' in keyword_processor # 判断关键词Love是否在词表中
keyword_processor.get_keyword('apple') # 与dict一样,看apple的key是啥
keyword_processor.get_all_keywords() # 所有的字符,一次性遍历出来
参考:
Flashtext:大规模数据清洗的利器
超大规模文本数据清洗、查找、匹配神器之python模块flashtext学习使用
关键字提取
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> # keyword_processor.add_keyword(<unclean name>, <standardised name>)
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area.')
>>> keywords_found
>>> # ['New York', 'Bay Area']
区分大小写字母
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor(case_sensitive=True)
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.')
>>> keywords_found
>>> # ['Bay Area']
关键字不清晰
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('Big Apple')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.')
>>> keywords_found
>>> # ['Big Apple', 'Bay Area']
同时添加多个关键词
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_dict = {
>>> "java": ["java_2e", "java programing"],
>>> "product management": ["PM", "product manager"]
>>> }
>>> # {'clean_name': ['list of unclean names']}
>>> keyword_processor.add_keywords_from_dict(keyword_dict)
>>> # Or add keywords from a list:
>>> keyword_processor.add_keywords_from_list(["java", "python"])
>>> keyword_processor.extract_keywords('I am a product manager for a java_2e platform')
>>> # output ['product management', 'java']
删除关键字
>>> from flashtext import KeywordProcessor >>> keyword_processor = KeywordProcessor() >>> keyword_dict = { >>> "java": ["java_2e", "java programing"], >>> "product management": ["PM", "product manager"] >>> } >>> keyword_processor.add_keywords_from_dict(keyword_dict) >>> print(keyword_processor.extract_keywords('I am a product manager for a java_2e platform')) >>> # output ['product management', 'java'] >>> keyword_processor.remove_keyword('java_2e') >>> # you can also remove keywords from a list/ dictionary >>> keyword_processor.remove_keywords_from_dict({"product management": ["PM"]}) >>> keyword_processor.remove_keywords_from_list(["java programing"]) >>> keyword_processor.extract_keywords('I am a product manager for a java_2e platform') >>> # output ['product management']
import ahocorasick def build_actree(wordlist): ''' AC自动机进行关键词匹配 构造AC trie ''' actree = ahocorasick.Automaton() # 初始化trie树 for index, word in enumerate(wordlist): actree.add_word(word, (index, word)) # 向trie树中添加单词 actree.make_automaton() # 将trie树转化为Aho-Corasick自动机 #self.actree = actree return actree def ac_detect(actree,text): ''' AC自动机进行关键词匹配 文本匹配 ''' region_wds = [] for w1 in actree.iter(text): if len(w1) > 0: region_wds.append(w1[1][1]) return region_wds wordlist = ['健康','减肥'] text = '今天你减肥了吗,今天你健康了吗,减肥 = 健康!' actree = build_actree(wordlist) %time ac_detect(actree,text) >>> CPU times: user 10 µs, sys: 3 µs, total: 13 µs >>> Wall time: 17.4 µs >>> ['减肥', '健康', '减肥', '健康']
与flashtext进行对比:
from flashtext import KeywordProcessor def build_actree(wordlist): ''' AC自动机进行关键词匹配 构造AC trie ''' actree = KeywordProcessor() for index, word in enumerate(wordlist): actree.add_keyword(word) # 向trie树中添加单词 #self.actree = actree return actree def ac_detect(actree,text,span_info = True): ''' AC自动机进行关键词匹配 文本匹配 ''' region_wds = [] for w1 in actree.extract_keywords(text,span_info = span_info): if len(w1) > 0: region_wds.append(w1[0]) return region_wds wordlist = ['健康','减肥'] text = '今天你减肥了吗,今天你健康了吗,减肥 = 健康!' actree = build_actree(wordlist) %time ac_detect(actree,text) >>> CPU times: user 41 µs, sys: 0 ns, total: 41 µs >>> Wall time: 47.2 µs >>> ['减肥', '健康', '减肥', '健康']
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。