赞
踩
介绍下NLP的基本技术:包括序列标注、N-gram模型、回退和评估。
将词汇按照词性分类并相应的对他们进行标注,也即:词性标注(part-of-speech tagging, POS tagging),也称作标注。
词性也称为词类或者词汇范畴。用于特定任务标记的集合被称作一个标记集。
5.1使用词性标注器
用以处理一个词序列,为每一个词附加词性标记。
>>> import nltk
>>> text = nltk.word_tokenize('and now for something completely different')
>>> nltk.pos_tag(text)
[('and', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]
其中CC代表并列连词,RB代表副词,IN是介词,NN是名词,JJ是形容词。
NLTK中提供了每一个标记的文档,可以使用标记来查询它所对应的词性。如:
>>> nltk.help.upenn_tagset('RB')
RB: adverb
occasionally unabatingly maddeningly adventurously professedly
stirringly prominently technologically magisterially predominately
swiftly fiscally pitilessly ...
反映出RB所代表的含义是副词。同样可以查询NN的含义是名词:
>>> nltk.help.upenn_tagset('NN')
NN: noun, common, singular or mass
common-carrier cabbage knuckle-duster Casino afghan shed thermostat
investment slide humour falloff slick wind hyena override subhumanity
machinist ...
可以用nltk.help.brown_tagset('NN.*')近似查询,如同正则表达式一般匹配,如NN.*代表名词在前的某些搭配。
NN: noun, singular, common
failure burden court fire appointment awarding compensation Mayor
interim committee fact effect airport management surveillance jail
doctor intern extern night weekend duty legislation Tax Office ...
NN$: noun, singular, common, genitive
season's world's player's night's chapter's golf's football's
baseball's club's U.'s coach's bride's bridegroom's board's county's
firm's company's superintendent's mob's Navy's ...
NN+BEZ: noun, singular, common + verb 'to be', present tense, 3rd person singular
text.similar(word)用来找寻在整个文本中具有与word相似用法的其他单词,也就是如果word1与word的上下文单词一致,那么word1则出现在此函数的返回列表中。
>>> text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
>>> text.similar('women')
people men others the time children that one work man af house girls
and two way state years water this
这些返回的单词在某种意义上具有与‘women’相同的用法。可以用similar()函数分析不同文章是否属于同一作者。
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
>>> text2.similar('lady')
man house day moment world person brother subject family wife time
woman year case men week colonel park manner sister
>>> text1.similar('lady')
body whale one ship crew pequod world fish english whales deep boat
seas side man harpooneers voyage ribs boats fire
可以看出text1与text2的风格不太一致,可以推测两本书的作者不是同一个人。
5.2标注语料库
表示已标注的标识符
NLTK中用一个由标识符和标记组成的元祖表示。用到的函数是str2tuple()。
>>> tagged_token = nltk.tag.str2tuple('fly/NN')
>>> tagged_token
('fly', 'NN')
>>> tagged_token[1]
'NN'
>>> tagged_token[0]
'fly'
读取已标记的语料库
只要语料库包含已标注的文本,NLTK的语料库接口都将有一个tagged_words()方法。
>>> nltk.corpus.brown.tagged_words()
[('The', 'AT'), ('Fulton', 'NP-TL'), ...]
>>> nltk.corpus.brown.tagged_words(tagset='universal')
[('The', 'DET'), ('Fulton', 'NOUN'), ...]
简化的词性标记集
标记 | 含义 | 例子 |
---|---|---|
ADJ | 形容词 | new, good, high, special, big, local |
ADV | 动词 | really, already, still, early, now |
CNJ | 连词 | and, or, but, if, while, although |
DET | 限定词 | the, a, some, most, every, no |
EX | 存在量词 | there, there’s |
FW | 外来词 | dolce, ersatz, esprit, quo, maitre |
MOD | 情态动词 | will, can, would, may, must, should |
N | 名词 | year, home, costs, time, education |
NP | 专有名词 | Alison, Africa, April, Washington |
NUM | 数词 | twenty-four, fourth, 1991, 14:24 |
PRO | 代词 | he, their, her, its, my, I, us |
P | 介词 | on, of, at, with, by, into, under |
TO | 词 to | to |
UH | 感叹词 | ah, bang, ha, whee, hmpf, oops |
V | 动词 | is, has, get, do, make, see, run |
VD | 过去式 | said, took, told, made, asked |
VG | 现在分词 | making, going, playing, working |
VN | 过去分词 | given, taken, begun, sung |
WH | Wh 限定词 | who, which, when, what, where, how |
. | 标点符号 | . , ; ! |
查看一下brown语料库词性的使用情况:
>>> from nltk.corpus import brown
>>> brown_news_tagged = brown.tagged_words(categories='news',tagset='universal')
>>> tag_fd = nltk.FreqDist(tag for (word,tag) in brown_news_tagged)
>>> tag_fd.keys()
dict_keys(['DET', 'NOUN', 'ADJ', 'VERB', 'ADP', '.', 'ADV', 'CONJ', 'PRT', 'PRON', 'NUM', 'X'])
>>> tag_fd.plot
<bound method FreqDist.plot of FreqDist({'NOUN': 30654, 'VERB': 14399, 'ADP': 12355, '.': 11928, 'DET': 11389, 'ADJ': 6706, 'ADV': 3349, 'CONJ': 2717, 'PRON': 2535, 'PRT': 2264, ...})>
>>> tag_fd.plot()
可以考虑使用nltk.app.concordance()函数调用nltk内置的图形界面搜索某个单词的用法,但是此处似乎支持搜索的语料库只有brown。
>>> nltk.app.concordance('fly/NN')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: app() takes 0 positional arguments but 1 was given
>>> nltk.app.concordance()
切换语料库之后会报错:
对名词、动词以及形容词和副词的标注中由于tagged_words()方法中的参数由simplify_tags=True改为tagset=‘universal’后返回的包含词-标注字典的列表与参考书中的不太一致,并且在以后可能的实践中,并不是一定会使用nltk自带的已经实现词性标记的文本进行研究,所以此处不再刻意扩展。
使用python字典映射词及其属性
定义一个空的字典,并手动添加四个词并标注其词性,之后可以按键索引值。
>>> import nltk
>>> pairs ={}
>>> pairs['genius']='N'
>>> pairs['monstrous']='ADJ'
>>> pairs['have']='V'
>>> pairs['carelessly']='ADV'
>>> pairs
{'genius': 'N', 'monstrous': 'ADJ', 'have': 'V', 'carelessly': 'ADV'}
由于字典不是序列而是映射,所以其键值对的顺序并不是按固有的顺序排列。要找到健可以将字典转化为一个链表。
>>> list(pairs)
['genius', 'monstrous', 'have', 'carelessly']
>>> sorted(pairs)
['carelessly', 'genius', 'have', 'monstrous']
>>> list(pairs)
['genius', 'monstrous', 'have', 'carelessly']
>>> for word in sorted(pairs):
... print(word+':',pairs[word])
...
carelessly: ADV
genius: N
have: V
monstrous: ADJ
当然也可以通过字典的固有方法如keys()、values()、items()访问作为单独链表的键、值以及键值对。
>>> pairs.keys()
dict_keys(['genius', 'monstrous', 'have', 'carelessly'])
>>> pairs.values()
dict_values(['N', 'ADJ', 'V', 'ADV'])
>>> pairs.items()
dict_items([('genius', 'N'), ('monstrous', 'ADJ'), ('have', 'V'), ('carelessly', 'ADV')])
>>> for key ,val in sorted(pairs.items()):
... print(key+':',val)
...
carelessly: ADV
genius: N
have: V
monstrous: ADJ
当一个单词具有多种词性时,可以使用链表存储值也就是存储其词性。
>>> pairs['sleep']='V'
>>> pairs
{'genius': 'N', 'monstrous': 'ADJ', 'have': 'V', 'carelessly': 'ADV', 'sleep': 'V'}
>>> pairs['sleep']='N'
>>> pairs
{'genius': 'N', 'monstrous': 'ADJ', 'have': 'V', 'carelessly': 'ADV', 'sleep': 'N'}
>>> pairs['sleep']=['N','V']
>>> pairs
{'genius': 'N', 'monstrous': 'ADJ', 'have': 'V', 'carelessly': 'ADV', 'sleep': ['N', 'V']}
字典的键必须是不可改变的类型,比如元组和字符串,而使用字典是不可以的。
>>> pairs['good','nice']='ADJ'
>>> pairs
{('good', 'nice'): 'ADJ'}
>>> pos = {['ideas','blogs','adventures']:'N'}
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
在某些时候,我们访问的单词(键)可能并不存在于字典中,这样查询时会返回错误信息。
python2.5之后自带一种方法可以使得当要查询的单词(键)不存在时,可以以预设的值类型存储到字典中。
>>> frequency = nltk.defaultdict(int)
>>> frequency['colorless']=4
>>> frequency['ideas']
0
>>> pos = nltk.defaultdict(lambda:'N')
>>> pos['colorless']='ADJ'
>>> pos['apple']
'N'
>>> pos.items()
dict_items([('colorless', 'ADJ'), ('apple', 'N')])
默认字典应用于较大规模的语言处理任务中,许多语言处理任务包括标注,费大力气来正确处理文本中只出现过一次的词。当有固定的词汇并且不会有新词出现时,可能处理效果会更好。在默认字典下预处理文本,并使用特殊的“超出词汇表”标识符,UNK替换低频词汇。
>> alice = nltk.corpus.gutenberg.words('carroll-alice.txt')
>>> vocab = nltk.FreqDist(alice)
>>> v1000 = list(vocab)[:1000]
>>> mapping = nltk.defaultdict(lambda:'UNK')
>>> for v in v1000:
... mapping[v]=v
...
>>> alice2 = [mapping[v] for v in alice]
>>> alice2[:100]
['[', 'Alice', "'", 's', 'Adventures', 'in', 'Wonderland', 'by', 'Lewis', 'Carroll', '1865', ']', 'CHAPTER', 'I', '.', 'Down', 'the', 'Rabbit', '-', 'Hole', 'Alice', 'was', 'beginning', 'to', 'get', 'very', 'tired', 'of', 'sitting', 'by', 'her', 'sister', 'on', 'the', 'bank', ',', 'and', 'of', 'having', 'nothing', 'to', 'do', ':', 'once', 'or', 'twice', 'she', 'had', 'peeped', 'into', 'the', 'book', 'her', 'sister', 'was', 'reading', ',', 'but', 'it', 'had', 'no', 'pictures', 'or', 'conversations', 'in', 'it', ',', "'", 'and', 'what', 'is', 'the', 'use', 'of', 'a', 'book', ",'", 'thought', 'Alice', "'", 'without', 'pictures', 'or', 'conversation', "?'", 'So', 'she', 'was', 'considering', 'in', 'her', 'own', 'mind', '(', 'as', 'well', 'as', 'she', 'could', ',']
>>> len(set(alice2))
递增更新字典
可以使用字典计数出现的次数。首先初始化一个空的defaultdict,然后处理文本中每个词性标记,如果以前没有出现相同标记,就默认该标记的计数为0。每遇到一个标记,都递增其计数值。
>>> counts = nltk.defaultdict(int)
>>> for (word,tag) in brown.tagged_words(categories='news'):
... counts[tag] += 1
...
c>>> counts['N']
0
>>> list(counts)
['AT', 'NP-TL', 'NN-TL', 'JJ-TL', 'VBD', 'NR', 'NN', 'IN', 'NP$', 'JJ', '``', "''", 'CS', 'DTI', 'NNS', '.', 'RBR', ',', 'WDT', 'HVD', 'VBZ', 'CC', 'IN-TL', 'BEDZ', 'VBN', 'NP', 'BEN', 'TO', 'VB', 'RB', 'DT', 'PPS', 'DOD', 'AP', 'BER', 'HV', 'DTS', 'VBG', 'PPO', 'QL', 'JJT', 'ABX', 'NN-HL', 'VBN-HL', 'WRB', 'CD', 'MD', 'BE', 'JJR', 'VBG-TL', 'BEZ', 'NN$-TL', 'HVZ', 'ABN', 'PN', 'PPSS', 'PP$', 'DO', 'NN$', 'NNS-HL', 'WPS', '*', 'EX', 'VB-HL', ':', '(', ')', 'NNS-TL', 'NPS', 'JJS', 'RP', '--', 'BED', 'OD', 'BEG', 'AT-HL', 'VBG-HL', 'AT-TL', 'PPL', 'DOZ', 'NP-HL', 'NR$', 'DOD*', 'BEDZ*', ',-HL', 'CC-TL', 'MD*', 'NNS$', 'PPSS+BER', "'", 'PPSS+BEM', 'CD-TL', 'RBT', '(-HL', ')-HL', 'MD-HL', 'VBZ-HL', 'IN-HL', 'JJ-HL', 'PPLS', 'CD-HL', 'WPO', 'JJS-TL', 'ABL', 'BER-HL', 'PPS+HVZ', 'VBD-HL', 'RP-HL', 'MD*-HL', 'AP-HL', 'CS-HL', 'DT$', 'HVN', 'FW-IN', 'FW-DT', 'VBN-TL', 'NR-TL', 'NNS$-TL', 'FW-NN', 'HVG', 'DTX', 'OD-TL', 'BEM', 'RB-HL', 'PPSS+MD', 'NPS-HL', 'NPS$', 'WP$', 'NN-TL-HL', 'CC-HL', 'PPS+BEZ', 'AP-TL', 'UH-TL', 'BEZ-HL', 'TO-HL', 'DO*', 'VBN-TL-HL', 'NNS-TL-HL', 'DT-HL', 'BE-HL', 'DOZ*', 'QLP', 'JJR-HL', 'PPSS+HVD', 'FW-IN+NN', 'PP$$', 'JJT-HL', 'NP-TL-HL', 'NPS-TL', 'MD+HV', 'NP$-TL', 'OD-HL', 'JJR-TL', 'VBD-TL', 'DT+BEZ', 'EX+BEZ', 'PPSS+HV', ':-HL', 'PPS+MD', 'UH', 'FW-CC', 'FW-NNS', 'BEDZ-HL', 'NN$-HL', '.-HL', 'HVD*', 'BEZ*', 'AP$', 'NP+BEZ', 'FW-AT-TL', 'VB-TL', 'RB-TL', 'MD-TL', 'PN+HVZ', 'FW-JJ-TL', 'FW-NN-TL', 'ABN-HL', 'PPS+BEZ-HL', 'NR-HL', 'HVD-HL', 'RB$', 'FW-AT-HL', 'DO-HL', 'PP$-TL', 'FW-IN-TL', 'WPS+BEZ', '*-HL', 'DTI-HL', 'PN-HL', 'CD$', 'BER*', 'NNS$-HL', 'PN$', 'BER-TL', 'TO-TL', 'FW-JJ', 'BED*', 'RB+BEZ', 'VB+PPO', 'PPSS-HL', 'HVZ*', 'FW-IN+NN-TL', 'FW-IN+AT-TL', 'NN-NC', 'JJ-NC', 'NR$-TL', 'FW-PP$-NC', 'FW-VB', 'FW-VB-NC', 'JJR-NC', 'NPS$-TL', 'QL-TL', 'FW-AT', 'FW-*', 'FW-CD', 'WQL', 'FW-WDT', 'WDT+BEZ', 'N']
>>> len(counts)
219
>>> from operator import itemgetter
>>> sorted(counts.items(),key=itemgetter(1),reverse=True)
[('NN', 13162), ('IN', 10616), ('AT', 8893), ('NP', 6866), (',', 5133), ('NNS', 5066), ('.', 4452), ('JJ', 4392), ('CC', 2664), ('VBD', 2524), ('NN-TL', 2486), ('VB', 2440), ('VBN', 2269), ('RB', 2166), ('CD', 2020), ('CS', 1509), ('VBG', 1398), ('TO', 1237), ('PPS', 1056), ('PP$', 1051), ('MD', 1031), ('AP', 923), ('NP-TL', 741), ('``', 732), ('BEZ', 730), ('BEDZ', 716), ("''", 702), ('JJ-TL', 689), ('PPSS', 602), ('DT', 589), ('BE', 525), ('VBZ', 519), ('NR', 495), ('RP', 482), ('QL', 468), ('PPO', 412), ('WPS', 395), ('NNS-TL', 344), ('WDT', 343), ('BER', 328), ('WRB', 328), ('OD', 309), ('HVZ', 301), ('--', 300), ('NP$', 279), ('HV', 265), ('HVD', 262), ('*', 256), ('BED', 252), ('NPS', 215), ('BEN', 212), ('NN$', 210), ('DTI', 205), ('NP-HL', 186), ('ABN', 183), ('NN-HL', 171), ('IN-TL', 164), ('EX', 161), (')', 151), ('(', 148), ('JJR', 145), (':', 137), ('DTS', 136), ('JJT', 100), ('CD-TL', 96), ('NNS-HL', 92), ('PN', 89), ('RBR', 88), ('VBN-TL', 87), ('ABX', 73), ('NN$-TL', 69), ('IN-HL', 65), ('DOD', 64), ('DO', 63), ('BEG', 57), (',-HL', 55), ('VBN-HL', 53), ('AT-TL', 50), ('NNS$', 50), ('CD-HL', 50), ('PPS+BEZ-HL', 1), ('HVD-HL', 1), ('RB$', 1), ('FW-AT-HL', 1), ('DO-HL', 1), ('PP$-TL', 1), ('FW-IN-TL', 1), ('*-HL', 1), ('PN-HL', 1), ('PN$', 1), ('BER-TL', 1), ('TO-TL', 1), ('BED*', 1), ('RB+BEZ', 1), ('VB+PPO', 1), ('PPSS-HL', 1), ('HVZ*', 1), ('FW-IN+NN-TL', 1), ('FW-IN+AT-TL', 1), ('JJ-NC', 1), ('NR$-TL', 1), ('FW-PP$-NC', 1), ('FW-VB', 1), ('FW-VB-NC', 1), ('JJR-NC', 1), ('NPS$-TL', 1), ('QL-TL', 1), ('FW-*', 1), ('FW-CD', 1), ('WQL', 1), ('FW-WDT', 1), ('WDT+BEZ', 1), ('N', 0)]
sorted()的第一个参数是要排序的项目,也就是词性搭配的种数,由一个pos标记和一个频率组成的元组链表。第二个参数使用itemgetter()指定排序键。最后一个参数的指定项目表明应以反序返回,即按频率值递减输出。
>>> last_letters = nltk.defaultdict(list)
>>> words = nltk.corpus.words.words('en')
>>> for word in words:
... key = word[-2:]
... last_letters[key].append(word)
...
>>> last_letters['lly']
[]
>>> len(last_letters['ly'])
11523
>>> anagrams = nltk.defaultdict(list)
>>> for word in words:
... key = ''.join(sorted(word))
... anagrams[key].append(word)
...
>>> anagrams['aeilnrt']
['entrail', 'latrine', 'ratline', 'reliant', 'retinal', 'trenail']
创建词字典并按照频率排序是一种常见的任务,所以NLTK提供了一种更为方便的创建方式:
>>> anagrams['aeilnrt']
['entrail', 'latrine', 'ratline', 'reliant', 'retinal', 'trenail']
>>> aragrams = nltk.Index((''.join(sorted(w)),w) for w in words)
>>> aragrams['aeilnrt']
['entrail', 'latrine', 'ratline', 'reliant', 'retinal', 'trenail']
可以看出的是nltk.Index是额外支持初始化的defaultdict(list),而nltk.FreqDist()的本质是额外支持初始化的defaultdict(附带排序和绘图方法)
字典支持高级查找,可获得任意键对应的值,但当给定一个值,需查找对应的键,并且需要多次执行这种操作,可建立一个映射值到键的字典。在任意两个键都不具有相同值的情况下,只要得到字典中所有的键值对,并创建新的值键对字典即可。
>>> pos = {'colorless':'ADJ','ideas':'N','sleep':'V','furiously':'ADV'}
>>> pos2 = dict((value,key) for (key,value) in pos.items())
>>> pos2['N']
'ideas'
>>> pos.update({'cats':'N'.'search':'V','peaceful':'ADV','old':'ADJ'})
>>> pos.update({'cats':'N','search':'V','peaceful':'ADV','old':'ADJ'})
>>> pos2 = nltk.defaultdict(list)
>>> for key,value in pos.items():
... pos2[value].append(key)
...
>>> pos2['ADV']
['furiously', 'peaceful']
用update方法在pos中加入一个词,创建多个具有相同值的情况,因为append()积累词性后,每个键所对应的词会有相同的词性,不满足此前建立逆向字典的方法就失效了。
python字典方法总结:
d1.update(d2):添加d2中所有项目到d1
defaultdict(int):一个默认值为0的字典
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。