赞
踩
在我们平时使用Word或者其他文字编辑软件的时候,常常会遇到单词纠错的功能。比如在Word中:
首先,我们需要一个语料库,基本上所有的NLP任务都会有语料库。单词纠错的语料库为bit.txt,里面包含的内容如下:
Gutenberg语料库数据;
维基词典;
英国国家语料库中的最常用单词列表。
下载的网址为:https://github.com/percent4/-word- 。
实现单词纠错的完整Python代码(spelling_correcter.py)如下:
- # -*- coding: utf-8 -*-
- import re, collections
-
- def tokens(text):
- """
- Get all words from the corpus
- """
- return re.findall('[a-z]+', text.lower())
-
- with open('E://big.txt', 'r') as f:
- WORDS = tokens(f.read())
- WORD_COUNTS = collections.Counter(WORDS)
-
- def known(words):
- """
- Return the subset of words that are actually
- in our WORD_COUNTS dictionary.
- """
- return {w for w in words if w in WORD_COUNTS}
-
-
- def edits0(word):
- """
- Return all strings that are zero edits away
- from the input word (i.e., the word itself).
- """
- return {word}
-
-
- def edits1(word):
- """
- Return all strings that are one edit away
- from the input word.
- """
- alphabet = 'abcdefghijklmnopqrstuvwxyz'
-
- def splits(word):
- """
- Return a list of all possible (first, rest) pairs
- that the input word is made of.
- """
- return [(word[:i], word[i:]) for i in range(len(word) + 1)]
-
- pairs = splits(word)
- deletes = [a + b[1:] for (a, b) in pairs if b]
- transposes = [a + b[1] + b[0] + b[2:] for (a, b) in pairs if len(b) > 1]
- replaces = [a + c + b[1:] for (a, b) in pairs for c in alphabet if b]
- inserts = [a + c + b for (a, b) in pairs for c in alphabet]
- return set(deletes + transposes + replaces + inserts)
-
-
- def edits2(word):
- """
- Return all strings that are two edits away
- from the input word.
- """
- return {e2 for e1 in edits1(word) for e2 in edits1(e1)}
-
-
- def correct(word):
- """
- Get the best correct spelling for the input word
- """
- # Priority is for edit distance 0, then 1, then 2
- # else defaults to the input word itself.
- candidates = (known(edits0(word)) or
- known(edits1(word)) or
- known(edits2(word)) or
- [word])
- return max(candidates, key=WORD_COUNTS.get)
-
-
- def correct_match(match):
- """
- Spell-correct word in match,
- and preserve proper upper/lower/title case.
- """
-
- word = match.group()
-
- def case_of(text):
- """
- Return the case-function appropriate
- for text: upper, lower, title, or just str.:
- """
- return (str.upper if text.isupper() else
- str.lower if text.islower() else
- str.title if text.istitle() else
- str)
-
- return case_of(word)(correct(word.lower()))
-
-
- def correct_text_generic(text):
- """
- Correct all the words within a text,
- returning the corrected text.
- """
- return re.sub('[a-zA-Z]+', correct_match, text)

有了上述的单词纠错程序,接下来我们对一些单词或句子做测试。如下:
- original_word_list = ['fianlly', 'castel', 'case', 'monutaiyn', 'foresta', \
- 'helloa', 'forteen', 'persreve', 'kisss', 'forteen helloa', \
- 'phons forteen Doora. This is from Chinab.']
-
- for original_word in original_word_list:
- correct_word = correct_text_generic(original_word)
- print('Orginial word: %s\nCorrect word: %s'%(original_word, correct_word))
输出结果如下:
Orginial word: fianlly
接着,我们对如下的Word文档(Spelling Error.docx)进行测试(下载地址为:https://github.com/percent4/-word-),
对该文档进行单词纠错的Python代码如下:
- from docx import Document
- from nltk import sent_tokenize, word_tokenize
- from spelling_correcter import correct_text_generic
- from docx.shared import RGBColor
-
- # 文档中修改的单词个数
- COUNT_CORRECT = 0
- #获取文档对象
- file = Document("E://Spelling Error.docx")
-
- #print("段落数:"+str(len(file.paragraphs)))
-
- punkt_list = r",.?\"'!()/\\-<>:@#$%^&*~"
-
- document = Document() # word文档句柄
-
- def write_correct_paragraph(i):
- global COUNT_CORRECT
-
- # 每一段的内容
- paragraph = file.paragraphs[i].text.strip()
- # 进行句子划分
- sentences = sent_tokenize(text=paragraph)
- # 词语划分
- words_list = [word_tokenize(sentence) for sentence in sentences]
-
- p = document.add_paragraph(' '*7) # 段落句柄
-
- for word_list in words_list:
- for word in word_list:
- # 每一句话第一个单词的第一个字母大写,并空两格
- if word_list.index(word) == 0 and words_list.index(word_list) == 0:
- if word not in punkt_list:
- p.add_run(' ')
- # 修改单词,如果单词正确,则返回原单词
- correct_word = correct_text_generic(word)
- # 如果该单词有修改,则颜色为红色
- if correct_word != word:
- colored_word = p.add_run(correct_word[0].upper()+correct_word[1:])
- font = colored_word.font
- font.color.rgb = RGBColor(0x00, 0x00, 0xFF)
- COUNT_CORRECT += 1
- else:
- p.add_run(correct_word[0].upper() + correct_word[1:])
- else:
- p.add_run(word)
- else:
- p.add_run(' ')
- # 修改单词,如果单词正确,则返回原单词
- correct_word = correct_text_generic(word)
- if word not in punkt_list:
- # 如果该单词有修改,则颜色为红色
- if correct_word != word:
- colored_word = p.add_run(correct_word)
- font = colored_word.font
- font.color.rgb = RGBColor(0xFF, 0x00, 0x00)
- COUNT_CORRECT += 1
- else:
- p.add_run(correct_word)
- else:
- p.add_run(word)
-
- for i in range(len(file.paragraphs)):
- write_correct_paragraph(i)
-
- document.save('E://correct_document.docx')
-
- print('修改并保存文件完毕!')
- print('一共修改了%d处。'%COUNT_CORRECT)

输出的结果如下:
修改并保存文件完毕!
修改后的Word文档如下:
其中的红色字体部分为原先的单词有拼写错误,进行拼写纠错后的单词,一共修改了19处。
单词纠错实现起来并没有想象中的那么难,但也不是那么容易~https://github.com/percent4/-word- 。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。