赞
踩
这篇博客主要讲的是关于英文的一些处理,关于中文的一些nlp处理后续有机会补上。本文主要有以下几个内容:
预处理是很多NLP任务的基础,一个好的预处理对后续的NLP结果有很重要的影响。首先是关于分词的一些问题,中文分词是老大难问题,不太好分,英文就简单得多了,但是英文还是会出现一些问题的,比如what’s,can’t这种基本的分词器就很难分好,有些带有否定语义的后期对整句的语义理解就会带来很大的问题。所以这边希望能够通过一些常规的正则化手段去替换掉这些能以分词成功的缩写,下面的代码就展示了这样的功能预处理,希望能够给予大家一点启发。
一些缩写的替换和符号的替换
def clean_text(text):
"""
Clean text
:param text: the string of text
:return: text string after cleaning
"""
# unit
text = re.sub(r"(\d+)kgs ", lambda m: m.group(1) + ' kg ', text) # e.g. 4kgs => 4 kg
text = re.sub(r"(\d+)kg ", lambda m: m.group(1) + ' kg ', text) # e.g. 4kg => 4 kg
text = re.sub(r"(\d+)k ", lambda m: m.group(1) + '000 ', text) # e.g. 4k => 4000
text = re.sub(r"\$(\d+)", lambda m: m.group(1) + ' dollar ', text)
text = re.sub(r"(\d+)\$", lambda m: m.group(1) + ' dollar ', text)
# acronym
text = re.sub(r"can\'t", "can not", text)
text = re.sub(r"cannot", "can not ", text)
text = re.sub(r"what\'s", "what is", text)
text = re.sub(r"What\'s", "what is", text)
text = re.sub(r"\'ve ", " have ", text)
text = re.sub(r"n\'t", " not ", text)
text = re.sub(r"i\'m", "i am ", text)
text = re.sub(r"I\'m", "i am ", text)
text = re.sub(r"\'re", " are ", text)
text = re.sub(r"\'d", " would ", text)
text = re.sub(r"\'ll", " will ", text)
text = re.sub(r"c\+\+", "cplusplus", text)
text = re.sub(r"c \+\+", "cplusplus", text)
text = re.sub(r"c \+ \+", "cplusplus", text)
text = re.sub(r"c#", "csharp", text)
text = re.sub(r"f#", "fsharp", text)
text = re.sub(r"g#", "gsharp", text)
text = re.sub(r" e mail ", " email ", text)
text = re.sub(r" e \- mail ", " email ", text)
text = re.sub(r" e\-mail ", " email ", text)
text = re.sub(r",000", '000', text)
text = re.sub(r"\'s", " ", text)
# spelling correction
text = re.sub(r"ph\.d", "phd", text)
text = re.sub(r"PhD", "phd", text)
text = re.sub(r"pokemons", "pokemon", text)
text = re.sub(r"pokémon", "pokemon", text)
text = re.sub(r"pokemon go ", "pokemon-go ", text)
text = re.sub(r" e g ", " eg ", text)
text = re.sub(r" b g ", " bg ", text)
text = re.sub(r" 9 11 ", " 911 ", text)
text = re.sub(r" j k ", " jk ", text)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。