当前位置:   article > 正文






  1. Mother's Day
  2. There is no doubt that mother is the greatest person in the world. She gives me life and takes care of me
  3. all the time. It is natural to have a day to show the great respect to our mothers, so people named a day
  4. as Mother's Day in the second Sunday of May. At this meaningful day, people find some specila ways to
  5. express gratitude.
  6. Mother's Day gives people a chance to express their gratityde to their mothers and it is important to do
  7. so. My parents are traditional and they don't often say words about love, but the things they do for me are
  8. never less. When I went to middle school, my classmates sent messagses to their mothers or just gave a call
  9. on Mother's Day, until then I realized I must do something for my mother. So I borrowed my friend's phone
  10. and sent a message to my mom, telling her that I loved her. Later, my father told me that mom was really happy.
  11. Every small act we do for our mothers will touch them deeply. If we can’t make her happy every day, at least
  12. let her know how much we love her.





此段落和文章标题相互呼应,点名主旨......点明失败,首先文章主旨不明确 。






关于朴素贝叶斯公式,详细一点的参考博文 朴素贝叶斯算法及其实战。这里不细说,初略过一下,因为鲁迅先生说过:文章每多一个公式,就会少一个读者





                        P(A):文章出现正确拼写词A的概率,程序中直接用词频表示,即(词A总数 / 所有词总词数)


  1. 科技:
  2. P(科技|影院,支付宝,云计算) = P(影院,支付宝,云计算|科技)*P(科技)
  3. =8/100*20/100*63/100*30/90
  4. = 126 / 37500
  5. 0.0036
  6. 娱乐:
  7. P(娱乐|影院,支付宝,云计算) = P(影院,支付宝,云计算|娱乐)*P(娱乐)
  8. =56/232*25/122*0/121*60/90
  9. = 0

5. 文章单词纠错器的主要逻辑与代码实现:

    5.1 从文件中逐步逐行读取内容,分割:

  1. with open(self.filePath, encoding="utf-8", errors="ignore") as f:
  2. text = f.readlines()
  3. for lines in text:
  4. wordsList = lines.split(" ") # 根据空格划分单词
  5. for oldWord in wordsList:
  6. # 和标点链接的单词,先略过了,待优化!!!
  7. if '"' in oldWord or "'" in oldWord or "." in oldWord or "," in oldWord: # Jason's or Jason: or Jason.
  8. pass
  9. else:
  10. if not oldWord or oldWord == "\n" or oldWord == "\t":
  11. pass
  12. else:
  13. rightWord = self.check(oldWord.replace("\r", "").replace("\n", "").replace("\t", ""))
  14. if rightWord not in oldWord:
  15. oldWord += '<span class="highlighted">(' + rightWord + ')</span>' #将错误单词在html中高亮显示
  16. print("原单词:", oldWord, " 你可能需要的单词为:", rightWord)
  17. newLines.append(oldWord)
  18. with open("./files/correct.html", "a", encoding="utf-8") as f:
  19. f.write("<p>" + " ".join(newLines) + "</p>") # 空格为界限拼接单词
  20. newLines = [] #每写入一次需要将newLines置空

    5.2 所读取单词拼写检查:


  1. def train(self):
  2. '''
  3. :return: 词频构成的字典 {key:value} key:word value:countNum
  4. '''
  5. text = open(self.bigtxtPath, 'r', encoding='utf-8').read()
  6. allWords = re.findall('[A-Za-z]+', text) #匹配出所有英文单词
  7. result = collections.defaultdict(lambda: 1)
  8. for word in allWords:
  9. result[word] += 1
  10. return result


  1. def edit_first(self, word):
  2. """
  3. 只编辑一次就把一个单词变为另一个单词
  4. :return: 所有与单词word编辑距离为1的集合
  5. """
  6. length = len(word)
  7. return set([word[0:i] + word[i + 1:] for i in range(length)] + # 从头至尾,依次将word中删除一个字母,构成一个新单词
  8. [word[0:i] + word[i + 1] + word[i] + word[i + 2:] for i in range(length - 1)] + # 从头至尾,依次将word中相邻的两个字母调换顺序,构成一个新单词
  9. [word[0:i] + c + word[i + 1:] for i in range(length) for c in self.alphabet] + # 从头至尾,依次将word中的一个字母进行修改
  10. [word[0:i] + c + word[i:] for i in range(length + 1) for c in self.alphabet]) # 从头到尾,依次在word中插入一个字母
  11. def edit_second(self, word):
  12. """
  13. 编辑两次的集合
  14. :return:
  15. """
  16. words = self.train()#得到存放着所有单词词频的字典
  17. return set(e2 for e1 in self.edit_first(word) for e2 in self.edit_first(e1) if e2 in words)
  18. def already_words(self, word):
  19. """
  20. 返回已知的和错误单词相近的正确单词集合,允许进行两次编辑
  21. :return:
  22. """
  23. words = self.train()
  24. return set(w for w in word if w in words)


  1. def check(self, word):
  2. words = self.train()
  3. #输入的单词是否在字典中 一次编辑的单词是否在字典中 二次编辑的单词是否在字典中
  4. neighborhood = self.already_words([word]) \
  5. or self.already_words(self.edit_first(word)) \
  6. or self.already_words(self.edit_second(word)) \
  7. or [word]
  8. # 取概率最大的正确单词,即词频最多的
  9. return max(neighborhood, key=lambda w: words[w])

    5.3 纠正后展示部分: 即将文章展示到html中


  1. with open("./files/correct.html", "w+", encoding="utf-8") as f:
  2. f.write('''
  3. <!DOCTYPE html>
  4. <html lang="en">
  5. <head>
  6. <meta charset="UTF-8">
  7. <title>文章错误单词高亮显示</title>
  8. <style>
  9. body{ text-align:center}
  10. .show{ margin:0 auto; width:60%; height:100%; border:2px solid}
  11. .highlighted{color:red;display:inline-block;}
  12. /* css注释:为了观察效果设置宽度 边框 高度等样式 */
  13. </style>
  14. </head>
  15. <body>
  16. <div class="show">
  17. ''')
  1. with open("./files/correct.html", "a", encoding="utf-8") as f:
  2. f.write("<p>" + " ".join(newLines) + "</p>") # 空格为界限拼接单词
  3. newLines = [] #每写入一次需要将newLines置空


  1. finally:
  2. with open("./files/correct.html", "a", encoding="utf-8") as f:
  3. f.write('''
  4. </div>
  5. </body>
  6. </html>
  7. ''')



  1. # -*- coding: UTF-8 -*-
  2. '''
  3. @Author :Jason
  4. Version3.0:read .txt files
  5. '''
  6. import re,collections
  7. class SpellCheck(object):
  8. def __init__(self,filePath):
  9. self.alphabet = list('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ')
  10. self.filePath = filePath
  11. self.bigtxtPath = "./files/big.txt"
  12. def train(self):
  13. '''
  14. :return: 词频构成的字典 {key:value} key:word value:countNum
  15. '''
  16. text = open(self.bigtxtPath, 'r', encoding='utf-8').read()
  17. allWords = re.findall('[A-Za-z]+', text) #匹配出所有英文单词
  18. result = collections.defaultdict(lambda: 1)
  19. for word in allWords:
  20. result[word] += 1
  21. return result
  22. def edit_first(self, word):
  23. """
  24. 只编辑一次就把一个单词变为另一个单词
  25. :return: 所有与单词word编辑距离为1的集合
  26. """
  27. length = len(word)
  28. return set([word[0:i] + word[i + 1:] for i in range(length)] + # 从头至尾,依次将word中删除一个字母,构成一个新单词
  29. [word[0:i] + word[i + 1] + word[i] + word[i + 2:] for i in range(length - 1)] + # 从头至尾,依次将word中相邻的两个字母调换顺序,构成一个新单词
  30. [word[0:i] + c + word[i + 1:] for i in range(length) for c in self.alphabet] + # 从头至尾,依次将word中的一个字母进行修改
  31. [word[0:i] + c + word[i:] for i in range(length + 1) for c in self.alphabet]) # 从头到尾,依次在word中插入一个字母
  32. def edit_second(self, word):
  33. """
  34. 编辑两次的集合
  35. :return:
  36. """
  37. words = self.train()#得到存放着所有单词词频的字典
  38. return set(e2 for e1 in self.edit_first(word) for e2 in self.edit_first(e1) if e2 in words)
  39. def already_words(self, word):
  40. """
  41. 返回已知的和错误单词相近的正确单词集合,允许进行两次编辑
  42. :return:
  43. """
  44. words = self.train()
  45. return set(w for w in word if w in words)
  46. def check(self, word):
  47. words = self.train()
  48. #输入的单词是否在字典中 一次编辑的单词是否在字典中 二次编辑的单词是否在字典中
  49. neighborhood = self.already_words([word]) \
  50. or self.already_words(self.edit_first(word)) \
  51. or self.already_words(self.edit_second(word)) \
  52. or [word]
  53. # 取概率最大的正确单词,即词频最多的
  54. return max(neighborhood, key=lambda w: words[w])
  55. def main(self):
  56. '''
  57. 主函数,对文档单词进行检查
  58. :return:None
  59. '''
  60. newLines = [] # 用于存放纠正后的一行文本
  61. with open("./files/correct.html", "w+", encoding="utf-8") as f:
  62. f.write('''
  63. <!DOCTYPE html>
  64. <html lang="en">
  65. <head>
  66. <meta charset="UTF-8">
  67. <title>文章错误单词高亮显示</title>
  68. <style>
  69. body{ text-align:center}
  70. .show{ margin:0 auto; width:60%; height:100%; border:2px solid}
  71. .highlighted{color:red;display:inline-block;}
  72. /* css注释:为了观察效果设置宽度 边框 高度等样式 */
  73. </style>
  74. </head>
  75. <body>
  76. <div class="show">
  77. ''')
  78. try:
  79. with open(self.filePath, encoding="utf-8", errors="ignore") as f:
  80. text = f.readlines()
  81. for lines in text:
  82. wordsList = lines.split(" ") # 根据空格划分单词
  83. for oldWord in wordsList:
  84. # 和标点链接的单词,先略过了,待优化!!!
  85. if '"' in oldWord or "'" in oldWord or "." in oldWord or "," in oldWord: # Jason's or Jason: or Jason.
  86. pass
  87. else:
  88. if not oldWord or oldWord == "\n" or oldWord == "\t":
  89. pass
  90. else:
  91. rightWord = self.check(oldWord.replace("\r", "").replace("\n", "").replace("\t", ""))
  92. if rightWord not in oldWord:
  93. oldWord += '<span class="highlighted">(' + rightWord + ')</span>' #将错误单词在html中高亮显示
  94. print("原单词:", oldWord, " 你可能需要的单词为:", rightWord)
  95. newLines.append(oldWord)
  96. with open("./files/correct.html", "a", encoding="utf-8") as f:
  97. f.write("<p>" + " ".join(newLines) + "</p>") # 空格为界限拼接单词
  98. newLines = [] #每写入一次需要将newLines置空
  99. except Exception as e:
  100. print("文章读取和单词检查出错:", e)
  101. finally:
  102. with open("./files/correct.html", "a", encoding="utf-8") as f:
  103. f.write('''
  104. </div>
  105. </body>
  106. </html>
  107. ''')
  108. if __name__ == '__main__':
  109. filePath = "./files/MotherDayArticle.txt" #文档路径
  110. s = SpellCheck(filePath=filePath) #实例化
  111. s.main()














