赞
踩
随着bert等技术的兴起,在做文本方面比赛时,对于预处理这一块像中文分词,停用词过滤,词形还原,词干化,标点符号处理等变的不再这么重要。当然也可以从另一个角度来看,这些对于文本的预处理方法相当于减少输入的噪声,是可以让神经网络更具有鲁棒性的。所以以下内容可以作为一个知识储备在这里,在工作中是否需要用到它们全凭自己判断。
### 用于词形还原 from nltk import word_tokenize, pos_tag from nltk.corpus import wordnet from nltk.stem import WordNetLemmatizer # 获取单词的词性 def get_wordnet_pos(tag): if tag.startswith('J'): return wordnet.ADJ elif tag.startswith('V'): return wordnet.VERB elif tag.startswith('N'): return wordnet.NOUN elif tag.startswith('R'): return wordnet.ADV else: return None sentence = 'football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal.' tokens = word_tokenize(sentence) # 分词 tagged_sent = pos_tag(tokens) # 获取单词词性 wnl = WordNetLemmatizer() lemmas_sent = [] for tag in tagged_sent: wordnet_pos = get_wordnet_pos(tag[1]) or wordnet.NOUN lemmas_sent.append(wnl.lemmatize(tag[0], pos=wordnet_pos)) # 词形还原 print(lemmas_sent)
输出结果为[‘football’, ‘be’, ‘a’, ‘family’, ‘of’, ‘team’, ‘sport’, ‘that’, ‘involve’, ‘,’, ‘to’, ‘vary’, ‘degree’, ‘,’, ‘kick’, ‘a’, ‘ball’, ‘to’, ‘score’, ‘a’, ‘goal’, ‘.’]
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。