当前位置:   article > 正文

C2W3.LAB.N-grams+Language Model+OOV

C2W3.LAB.N-grams+Language Model+OOV

理论课:C2W3.Auto-complete and Language Models


理论课: C2W3.Auto-complete and Language Models

N-grams Corpus preprocessing

文本预处理在之前已经学过,这里需要对语言模型的文本预处理操作和前面学过的预处理进行区分。
语言模型的一些常见预处理步骤包括:

  • lowercasing the text
  • remove special characters
  • split text to list of sentences
  • split sentence into list words

导入包:

import nltk               # NLP toolkit
import re                 # Library for Regular expression operations
  • 1
  • 2

Lowercase

句子开头的单词、人名和专有名词以大写字母开头。但是,在计算单词时,要将它们与出现在句子中间的单词同等对待。使用的转换函数看这里:str.lowercase

# change the corpus to lowercase
corpus = "Learning% makes 'me' happy. I am happy be-cause I am learning! :)"
corpus = corpus.lower()

# note that word "learning" will now be the same regardless of its position in the sentence
print(corpus)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

结果:
learning% makes ‘me’ happy. i am happy be-cause i am learning!

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/天景科技苑/article/detail/868907
推荐阅读
相关标签