赞
踩
理论课:C2W3.Auto-complete and Language Models
文本预处理在之前已经学过,这里需要对语言模型的文本预处理操作和前面学过的预处理进行区分。
语言模型的一些常见预处理步骤包括:
导入包:
import nltk # NLP toolkit
import re # Library for Regular expression operations
句子开头的单词、人名和专有名词以大写字母开头。但是,在计算单词时,要将它们与出现在句子中间的单词同等对待。使用的转换函数看这里:str.lowercase
# change the corpus to lowercase
corpus = "Learning% makes 'me' happy. I am happy be-cause I am learning! :)"
corpus = corpus.lower()
# note that word "learning" will now be the same regardless of its position in the sentence
print(corpus)
结果:
learning% makes ‘me’ happy. i am happy be-cause i am learning!
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。