当前位置:   article > 正文

nltk:python自然语言处理三 标准化_nltk stopwords

nltk stopwords

文本的标注化处理主要涉及清楚标点符号、统一大小写、数字的处理、扩展缩略词等文本的规范化操作

1.清除标点

  1. import re
  2. import string
  3. from nltk import word_tokenize
  4. text = """
  5. I Love there things in this world.
  6. Sun, Moon and You.
  7. Sun for morning, Moon for
  8. night, and You forever.
  9. """
  10. # 分词
  11. words = word_tokenize(text)
  12. # ['I', 'Love', 'there', 'things', 'in', 'this', 'world', '.', 'Sun', ',', 'Moon', 'and', 'You', '.', 'Sun', 'for', 'morning', ',', 'Moon', 'for', 'night', ',', 'and', 'You', 'forever', '.']
  13. # re.escape(string) 返回一个字符串, 其中的所有非字母数字字符都带有反斜杠
  14. # string.punctuation 所有的(英文)标点符号
  15. regex_punctuation = re.compile('[%s]' % re.escape(string.punctuation))
  16. # 将每个单词中的标点全部替换为空,如果替换后为字符则过滤掉 弊端:数字的小数点、人名间隔符会被清除
  17. new_words = filter(lambda word: word != "", [regex_punctuation.sub("", word) for word in words])
  18. print(new_words)
  19. # ['I', 'Love', 'there', 'things', 'in', 'this', 'world', 'Sun', 'Moon', 'and', 'You', 'Sun', 'for', 'morning', 'Moon', 'for', '20', 'do', 'nt', 'night', 'and', 'You', 'forever']

2.统一大小写

text = "I Love there things in this world. "
text.lower()
text.upper()

3.处理停用词

过滤掉大量出现又没有实际意义的词

  1. from nltk.corpus import stopwords
  2. # stopwords是WordListCorpusReader的一个实例 WordListCorpusReader有一个word方法用于获取停用词
  3. # 可以通过word方法的filed参数指定获取那个语言的停用词 如果不指定会获取所有语言的停用词
  4. # 实例的fileids方法可以查看nltk文件中都包含哪些语言的停用词库
  5. # 使用stopwords中的英文停用词库
  6. stop_words = set(stopwords.words("english"))
  7. words = ['I', 'Love', 'there', 'things', 'in', 'this', 'world', 'Sun', 'Moon', 'and', 'You',
  8. 'Sun', 'for', 'morning', 'Moon', 'for', 'night', 'and', 'You', 'forever']
  9. # 过滤words中存在于停用词库中的单词
  10. new_words = [word for word in words if word not in stop_words]

4.替换和校正标识符

# 对缩略词进行格式化,如将isn't替换为is not。一般在分词前进行替换,避免切分缩略词时出现问题

  1. import re
  2. replace_patterns = [
  3. (r"can\'t", "cannot"),
  4. (r"won't", "will not"),
  5. (r"i'm", "i am"),
  6. (r"isn't", "is not"),
  7. (r"(\w+)'ll", "\g<1> will"),
  8. (r"(\w+)n't", "\g<1> not"),
  9. (r"(\w+)'ve", "\g<1> have"),
  10. (r"(\w+)'s", "\g<1> is"),
  11. (r"(\w+)'re", "\g<1> are"),
  12. (r"(\w+)'d", "\g<1> would"),
  13. ]
  14. class RegexpReplacer(object):
  15. def __init__(self, replace_patterns=replace_patterns):
  16. self.parrents = [(re.compile(regex), repl) for regex, repl in replace_patterns]
  17. def replace(self, text):
  18. for parrent, repl in self.parrents:
  19. text, count = re.subn(pattern=parrent, repl=repl, string=text)
  20. return text
  21. replacer = RegexpReplacer()
  22. text = "The hard part isn't making the decision. It's living with it."
  23. print(replacer.replace(text))
  24. # The hard part is not making the decision. It is living with it.

5.消除重复字符

# 如手抖将like写成了likeeee,需要规整为like

# 但是happy就是正常的单词不能处理, 这里我们借助语料库中wordnet来检查单词

  1. from nltk.corpus import wordnet
  2. class RepeatReplacer(object):
  3. def __init__(self):
  4.   # 能匹配则表示有重复字符
  5. self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
  6.   # 替换后会去除一个重复字符
  7. self.repl = r'\1\2\3'
  8. def replace(self, word):
  9. # 获取尝试获取同义词集, 如果有结果即表示是正常单词
  10. if wordnet.synsets(word):
  11. return word
  12.   # 如果替换后还是自己(即没有重复字符)则返回,否则进行递归替换
  13. repl_word = self.repeat_regexp.sub(self.repl, word)
  14. if repl_word != word:
  15. return self.replace(repl_word)
  16. return word
  17. replacer = RepeatReplacer()
  18. print(replacer.replace("likkeee"))
  19. # like
  20. print(replacer.replace("happpyyy"))
  21. # happy
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/凡人多烦事01/article/detail/353712
推荐阅读
相关标签
  

闽ICP备14008679号