赞
踩
NLTK(www.nltk.org)是在处理预料库、分类文本、分析语言结构等多项操作中最长遇到的包。其收集的大量公开数据集、模型上提供了全面、易用的接口,涵盖了分词、词性标注(Part-Of-Speech tag, POS-tag)、命名实体识别(Named Entity Recognition, NER)、句法分析(Syntactic Parse)等各项 NLP 领域的功能。
目录
3. 词汇规范化(Lexicon Normalization)
NLTK模块及功能介绍:
文本是由段落(Paragraph)构成的,段落是由句子(Sentence)构成的,句子是由单词构成的。切词是文本分析的第一步,它把文本段落分解为较小的实体(如单词或句子),每一个实体叫做一个Token,Token是构成句子(sentence )的单词、是段落(paragraph)的句子。NLTK能够实现句子切分和单词切分两种功能。
句子切分是指把段落切分成句子:
- from nltk.tokenize import sent_tokenize
-
- text="""Hello Mr. Smith, how are you doing today? The weather is great, and
- city is awesome.The sky is pinkish-blue. You shouldn't eat cardboard"""
-
- tokenized_text=sent_tokenize(text)
-
- print(tokenized_text)
-
-
- '''
- 结果:
- ['Hello Mr. Smith, how are you doing today?',
- 'The weather is great, and city is awesome.The sky is pinkish-blue.',
- "You shouldn't eat cardboard"]
- '''
单词切分是把句子切分成单词
- import nltk
-
- sent = "I am almost dead this time"
-
- token = nltk.word_tokenize(sent)
-
- 结果:token['I','am','almost','dead','this','time']
对切词的处理,需要移除标点符号和移除停用词和词汇规范化。
对每个切词调用该函数,移除字符串中的标点符号,string.punctuation包含了所有的标点符号,从切词中把这些标点符号替换为空格。
- # 方式一
- import string
-
- s = 'abc.'
- s = s.translate(str.maketrans(string.punctuation, " "*len(string.punctuation))) # abc
-
-
- # 方式二
- english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%']
- text_list = [word for word in text_list if word not in english_punctuations]
停用词(stopword)是文本中的噪音单词,没有任何意义,常用的英语停用词,例如:is, am, are, this, a, an, the。NLTK的语料库中有一个停用词,用户必须从切词列表中把停用词去掉。
- nltk.download('stopwords')
- # Downloading package stopwords to C:\Users\Administrator\AppData\Roaming\nltk_data...Unzipping corpora\stopwords.zip.
-
- from nltk.corpus import stopwords
- stop_words = stopwords.words("english")
-
- text="""Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome."""
-
- word_tokens = nltk.tokenize.word_tokenize(text.strip())
- filtered_word = [w for w in word_tokens if not w in stop_words]
-
-
- '''
- word_tokens:['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?',
- 'The', 'weather', 'is', 'great', ',', 'and', 'city', 'is', 'awesome', '.']
- filtered_word:['Hello', 'Mr.', 'Smith', ',', 'today', '?', 'The', 'weather', 'great', ',', 'city', 'awesome', '.']
- '''
词汇规范化是指把词的各种派生形式转换为词根,在NLTK中存在两种抽取词干的方法porter和wordnet。
利用上下文语境和词性来确定相关单词的变化形式,根据词性来获取相关的词根,也叫lemma,结果是真实的单词。
从单词中删除词缀并返回词干,可能不是真正的单词。
- from nltk.stem.wordnet import WordNetLemmatizer # from nltk.stem import WordNetLemmatizer
- lem = WordNetLemmatizer() # 词形还原
-
- from nltk.stem.porter import PorterStemmer # from nltk.stem import PorterStemmer
- stem = PorterStemmer() # 词干提取
-
- word = "flying"
- print("Lemmatized Word:",lem.lemmatize(word,"v"))
- print("Stemmed Word:",stem.stem(word))
-
- '''
- Lemmatized Word: fly
- Stemmed Word: fli
- '''
词性(POS)标记的主要目标是识别给定单词的语法组,POS标记查找句子内的关系,并为该单词分配相应的标签。
- sent = "Albert Einstein was born in Ulm, Germany in 1879."
- tokens = nltk.word_tokenize(sent)
-
- tags = nltk.pos_tag(tokens)
-
- '''
- [('Albert', 'NNP'), ('Einstein', 'NNP'), ('was', 'VBD'), ('born', 'VBN'),
- ('in', 'IN'), ('Ulm', 'NNP'), (',', ','), ('Germany', 'NNP'), ('in', 'IN'), ('1879', 'CD'), ('.', '.')]
- '''
查看一个单词的同义词集用synsets(); 它有一个参数pos,可以指定查找的词性。WordNet接口是面向语义的英语词典,类似于传统字典。它是NLTK语料库的一部分。
- import nltk
- nltk.download('wordnet') # Downloading package wordnet to C:\Users\Administrator\AppData\Roaming\nltk_data...Unzipping corpora\wordnet.zip.
-
- from nltk.corpus import wordnet
-
- word = wordnet.synsets('spectacular')
- print(word)
- # [Synset('spectacular.n.01'), Synset('dramatic.s.02'), Synset('spectacular.s.02'), Synset('outstanding.s.02')]
-
- print(word[0].definition())
- print(word[1].definition())
- print(word[2].definition())
- print(word[3].definition())
-
- '''
- a lavishly produced performance
- sensational in appearance or thrilling in effect
- characteristic of spectacles or drama
- having a quality that thrusts itself into attention
- '''
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。