赞
踩
一、词性标注简介
import nltk
text1=nltk.word_tokenize("It is a pleasant day today")
print(nltk.pos_tag(text1))
Number | Tag | Description |
1. | CC | Coordinating conjunction |
2. | CD | Cardinal number |
3. | DT | Determiner |
4. | EX | Existential there |
5. | FW | Foreign word |
6. | IN | Preposition or subordinating conjunction |
7. | JJ | Adjective |
8. | JJR | Adjective, comparative |
9. | JJS | Adjective, superlative |
10. | LS | List item marker |
11. | MD | Modal |
12. | NN | Noun, singular or mass |
13. | NNS | Noun, plural |
14. | NNP | Proper noun, singular |
15. | NNPS | Proper noun, plural |
16. | PDT | Predeterminer |
17. | POS | Possessive ending |
18. | PRP | Personal pronoun |
19. | PRP$ | Possessive pronoun |
20. | RB | Adverb |
21. | RBR | Adverb, comparative |
22. | RBS | Adverb, superlative |
23. | RP | Particle |
24. | SYM | Symbol |
25. | TO | to |
26. | UH | Interjection |
27. | VB | Verb, base form |
28. | VBD | Verb, past tense |
29. | VBG | Verb, gerund or present participle |
30. | VBN | Verb, past participle |
31. | VBP | Verb, non-3rd person singular present |
32. | VBZ | Verb, 3rd person singular present |
33. | WDT | Wh-determiner |
34. | WP | Wh-pronoun |
35. | WP$ | Possessive wh-pronoun |
36. | WRB | Wh-adverb |
构建(标识符,标记)组成的元组
import nltk
taggedword=nltk.tag.str2tuple('bear/NN')
print(taggedword)
print(taggedword[0])
print(taggedword[1])import nltk
sentence='''The/DT sacred/VBN Ganga/NNP flows/VBZ in/IN this/DT region/NN ./. This/DT is/VBZ a/DT pilgrimage/NN ./. People/NNP from/IN all/DT over/IN the/DT country/NN visit/NN this/DT place/NN ./. '''
print([nltk.tag.str2tuple(t) for t in sentence.split()])将元组返回成字符串
import nltk
taggedtok = ('bear', 'NN')
from nltk.tag.util import tuple2str
print(tuple2str(taggedtok))统计标记出现的频率
import nltk
from nltk.corpus import treebank
treebank_tagged = treebank.tagged_words(tagset='universal')
tag = nltk.FreqDist(tag for (word, tag) in treebank_tagged)
print(tag.most_common())设置默认标记和去除标记
import nltk
from nltk.tag import DefaultTagger
tag = DefaultTagger('NN')
print(tag.tag(['Beautiful', 'morning']))import nltk
from nltk.tag import untag
print(untag([('beautiful', 'NN'), ('morning', 'NN')]))
用NLTK库实现标注任务的方式主要有两种
1、使用NLTK库或者其他库中的预置标注器,并将其运用到测试数据上(足够英文和不特殊的任务)
2、基于测试数据来创建或训练适用的标注器,这意味着要处理一个非常特殊的用例
一个典型的标准器需要大量的训练数据,他主要被用于标注出句子的各个单词,人们已经花了大量力气去标注一些内容,如果需要训练处自己的POS标准器,应该也算的上高手了....我们下面了解一些标注器的性能
- 顺序标注器
- 让我们的标注器的tag都是 ‘NN’ 这样一个标记.....
import nltk
from nltk.corpus import brownbrown_tagged_sents=brown.tagged_sents(categories='news')
default_tagger=nltk.DefaultTagger('NN')
print( default_tagger.evaluate(brown_tagged_sents)) #0.13 效率低下说明这样的标注器是个shi...
- 使用我们前几章说的N-grams的标注器
我们使用N-gram前面90%作为训练集,训练他的规则,然后拿剩下10%作为测试集,看这样的标注器效果如何
import nltk
from nltk.corpus import brown
from nltk.tag import UnigramTagger
from nltk.tag import DefaultTagger
from nltk.tag import BigramTagger
from nltk.tag import TrigramTagger
brown_tagged_sents=brown.tagged_sents(categories='news')
default_tagger=nltk.DefaultTagger('NN')
train_data=brown_tagged_sents[:int(len(brown_tagged_sents)*0.9)]
test_data=brown_tagged_sents[int(len(brown_tagged_sents)*0.9):]
unigram_tagger=UnigramTagger(train_data,backoff=default_tagger)
print( unigram_tagger.evaluate(test_data) )bigram_tagger=BigramTagger(train_data,backoff=unigram_tagger)
print( bigram_tagger.evaluate(test_data) )trigram_tagger=TrigramTagger(train_data,backoff=bigram_tagger)
print( trigram_tagger.evaluate(test_data) )为了更加清楚训练和测试的过程,下面给出下面的代码
import nltk
from nltk.corpus import treebank
from nltk.tag import UnigramTagger
unitag = UnigramTagger(model={'Vinken': 'NN'}) # 只标记一个tag,让这样一个数据进行训练
print(unitag.tag(treebank.sents()[0]))import nltk
from nltk.corpus import treebank
from nltk.tag import UnigramTagger
training= treebank.tagged_sents()[:7000]
unitagger=UnigramTagger(training) #使用数据集训练
testing = treebank.tagged_sents()[2000:]
print(unitagger.evaluate(testing))谈谈回退机制backoff的作用
这是顺序标记的一个主要特征吧,如果这有限的训练集中,你无法得到这次数据的Tag,可以使用下一个标注器来标注这个单词;
当然还有很多标准器,可以读一下源码......
import nltk
from nltk.tag import AffixTagger
from nltk.corpus import treebank
testing = treebank.tagged_sents()[2000:]
training= treebank.tagged_sents()[:7000]
prefixtag = AffixTagger(training, affix_length=4) #使用四个前缀...
print(prefixtag.evaluate(testing))import nltk
from nltk.tag import AffixTagger
from nltk.corpus import treebank
testing = treebank.tagged_sents()[2000:]
training= treebank.tagged_sents()[:7000]
suffixtag = AffixTagger(training, affix_length=-3) #使用三个后缀...
print(suffixtag.evaluate(testing))
基于机器学习的训练模型...后面再继续学习
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。