赞
踩
繁星淼淼
总之:NLTK并不完美。事实上,没有一个模型是完美的。注:截至NLTK版本3.1,默认pos_tag函数不再是古老的MaxEnt英国泡菜.现在是感知器从…@Honnibal的实施,见nltk.tag.pos_tag>>> import inspect>>> print inspect.getsource(pos_tag)def pos_tag(tokens, tagset=None):
tagger = PerceptronTagger()
return _pos_tag(tokens, tagset, tagger)尽管如此,它还是更好,但并不完美:>>> from nltk import pos_tag>>> pos_tag("The quick brown fox jumps over the lazy dog".split())[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]在某个时候,如果有人想TL;DR解决方案,见https://github.com/alvations/nltk_cli长时间:尝试使用其他标记(请参见https:/github.com/nltk/tree/Development/nltk/tag),如::亨波斯斯坦福POS塞纳使用来自NLTK的默认MaxEnt POS标签,即nltk.pos_tag:>>> from nltk import word_tokenize, pos_tag>>> text = "The quick brown fox jumps over the lazy dog">>> pos_tag(word_tokenize(text))[('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dog', 'NN')]使用斯坦福POS标签:$ cd ~$ wget http://nlp.stanford.edu/software/stanford-postagger-2015-04-20.zip$ unzip stanford-postagger-2015-04-20.zip$ mv stanford-postagger-2015-04-20 stanford-postagger
$ python>>> from os.path import expanduser>>> home = expanduser("~")>>> from nltk.tag.stanford import POSTagger>>> _path_to_model = home + '/stanford-postagger/models/english-bidirectional-distsim.tagger'>>> _path_to_jar = home + '/stanford-postagger/stanford-postagger.jar'>>> st = POSTagger(path_to_model=_path_to_model, path_to_jar=_path_to_jar)>>> text = "The quick brown fox jumps over the lazy dog">>> st.tag(text.split())[(u'The', u'DT'), (u'quick', u'JJ'), (u'brown', u'JJ'), (u'fox', u'NN'), (u'jumps', u'VBZ'), (u'over', u'IN'), (u'the', u'DT'), (u'lazy', u'JJ'), (u'dog', u'NN')]使用HunPOS(注:默认编码是ISO-8859-1,而不是UTF 8):$ cd ~$ wget https://hunpos.googlecode.com/files/hunpos-1.0-linux.tgz
$ tar zxvf hunpos-1.0-linux.tgz
$ wget https://hunpos.googlecode.com/files/en_wsj.model.gz
$ gzip -d en_wsj.model.gz
$ mv en_wsj.model hunpos-1.0-linux/$ python>>> from os.path import expanduser>>> home = expanduser("~")>>> from nltk.tag.hunpos import HunposTagger>>> _path_to_bin = home + '/hunpos-1.0-linux/hunpos-tag'>>> _path_to_model = home + '/hunpos-1.0-linux/en_wsj.model'>>> ht = HunposTagger(path_to_model=_path_to_model, path_to_bin=_path_to_bin)>>> text = "The quick brown fox jumps over the lazy dog">>> ht.tag(text.split())[('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]使用Senna(确保您已经掌握了NLTK的最新版本,对API进行了一些更改):$ cd ~$ wget http://ronan.collobert.com/senna/senna-v3.0.tgz$ tar zxvf senna-v3.0.tgz$ python>>> from os.path import expanduser>>> home = expanduser("~")>>> from nltk.tag.senna import SennaTagger>>> st = SennaTagger(home+'/senna')>>> text = "The quick brown fox jumps over the lazy dog">>> st.tag(text.split())[('The', u'DT'), ('quick', u'JJ'), ('brown', u'JJ'), ('fox', u'NN'), ('jumps', u'VBZ'), ('over', u'IN'), ('the', u'DT'), ('lazy', u'JJ'), ('dog', u'NN')]或者尝试建立一个更好的POS标签:Ngram tagger:http:/Streamhacker.com/2008/11/03/词缀/标记:http:/Streamhacker.com/2008/11/10/构建自己的Brill(阅读代码这是一个相当有趣的标签,http:/www.nltk.org/_Module/nltk/tag/brill.html),见http:/Streamhacker.com/2008/12/03/Perceptron Tagger:https:/honnibal.wordpress.com/2013/09/11LDA标记:http:/scm.io/blog/hack/2015/02/lda-意图/抱怨pos_tag堆栈溢出的准确性包括:pos标记-nltk认为名词是形容词。PythonNLTK POS标签未按预期行事如何使用NLTK pos标签获得更好的结果NLTK中的pos_tag不能正确标记句子有关NLTK HunPos的问题包括:如何在nltk中标记带有hunpos的文本文件?有人知道如何在nltk上配置hunpos包装器类吗?NLTK和StanfordPOS标签的问题包括:难以将斯坦福pos标签导入nltkJava命令在NLTK斯坦福POS标签中失败在NLTK Python中使用StanfordPOS标签时出错如何利用StanfordNLP标签和NLTK提高速度nltk stanfordpos标签错误:java命令失败在NLTK中实例化和使用StanfordTagger在NLTK中运行StanfordPOS标签将导致Windows上“无效的Win 32应用程序”
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。