赞
踩
单词纠错
在我们平时使用Word或者其他文字编辑软件的时候,常常会遇到单词纠错的功能。比如在Word中:
单词拼写错误
单词纠错算法
首先,我们需要一个语料库,基本上所有的NLP任务都会有语料库。单词纠错的语料库为bit.txt,里面包含的内容如下:
Gutenberg语料库数据;
维基词典;
英国国家语料库中的最常用单词列表。
下载的网址为:https://github.com/percent4/-word- 。
Python实现
实现单词纠错的完整Python代码(spelling_correcter.py)如下:
# -*- coding: utf-8 -*-
import re, collections
def tokens(text):
"""
Get all words from the corpus
"""
return re.findall('[a-z]+', text.lower())
with open('E://big.txt', 'r') as f:
WORDS = tokens(f.read())
WORD_COUNTS = collections.Counter(WORDS)
def known(words):
"""
Return the subset of words that are actually
in our WORD_COUNTS dictionary.
"""
return {w for w in words if w in WORD_COUNTS}
def edits0(word):
"""
Return all strings that are zero edits away
from the input word (i.e., the word itself).
"""
return {word}
def edits1(word):
"""
Return all strings that are one edit away<
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。