当前位置:   article > 正文

自然语言处理库——TextBlob

textblob

        TextBlob(https://textblob.readthedocs.io/en/dev/index.html)是一个用于处理文本数据的Python库。它提供一个简单的API,可用于深入研究常见的NLP任务,如词性标注、名词短语提取、情感分析、文本翻译、分类等。

官方文档https://textblob.readthedocs.io/en/dev/

目录

1. 情感分析

2.词性标注

3. 分词和分句

4. 名词短语列表

5. 词形还原及词干提取

(1)单复数

(2)Word 类

(3)WordNet:获取近义词

6. 拼写矫正

(1)直接矫正

(2)Word 拼写检查

7. 单词词频

(1)单词词频

(2)短语频次

8. 翻译及语言检测语言


1. 情感分析

        情感指的是隐藏在句子中的观点,极性(polarity)定义句子中的消极性或积极性,主观性(subjectivity)暗示句子的表达的含糊的、还是肯定的。

        返回一个元组 Sentiment(polarity, subjectivity)

       polarity: [-1.0, 1.0].     -1.0 消极,1.0积极

      subjectivity: [0.0, 1.0]      0.0 表示客观,1.0表示主观.

  1. from textblob import TextBlob
  2. text = "Textblob is amazingly simple to use. What great fun!"
  3. blob = TextBlob(text) # 创建一个textblob对象
  4. from textblob import TextBlob
  5. result = blob.sentiment
  6. # Sentiment(polarity=0.39166666666666666, subjectivity=0.4357142857142857)
  7. polarity = blob.sentiment.polarity # 0.39166666666666666

2.词性标注

  1. wiki = TextBlob("Python is a high-level, general-purpose programming language.")
  2. tag = wiki.tags
  3. # [('Python', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('high-level', 'JJ'), ('general-purpose', 'JJ'), ('programming', 'NN'), ('language', 'NN')]

3. 分词和分句

  1. blob = TextBlob("Beautiful is better than ugly. "
  2. "Explicit is better than implicit. "
  3. "Simple is better than complex.")
  4. word = blob.words
  5. sentence = blob.sentences
  6. '''
  7. ['Beautiful', 'is', 'better', 'than', 'ugly', 'Explicit', 'is', 'better', 'than', 'implicit', 'Simple', 'is', 'better', 'than', 'complex']
  8. [Sentence("Beautiful is better than ugly."),
  9. Sentence("Explicit is better than implicit."),
  10. Sentence("Simple is better than complex.")]
  11. '''

4. 名词短语列表

  1. list = wiki.noun_phrases
  2. # ['python']

5. 词形还原及词干提取

(1)单复数

       singularize() 变单数, pluralize()变复数,用在对名词进行处理,且会考虑特殊名词单复数形式

  1. sentence = TextBlob('Use 4 spaces per indentation level.')
  2. word = sentence.words
  3. danshu = word[2].singularize() # space
  4. fushu = word[-1].pluralize() # levels

(2)Word 类

     lemmatize() 方法  对单词进行词形还原名词找单数,动词找原型。所以需要一次处理名词,一次处理动词。

  1. from textblob import Word
  2. w1 = Word('apples')
  3. result1 = w1.lemmatize() # 默认只处理名词 apple
  4. w2 = Word('went')
  5. result2 = w2.lemmatize("v") # 对动词原型处理 go

(3)WordNet:获取近义词

  1. # 1.获取近义词
  2. from textblob import Word
  3. from textblob.wordnet import VERB
  4. result1 = Word("hack").synsets
  5. result2 = Word("hack").get_synsets(pos=VERB)
  6. #get_synsets(): 只查找 该词作为 动词 的集合,参数为空时和synsets方法相同
  7. '''
  8. result1:[Synset('hack.n.01'), Synset('machine_politician.n.01'), Synset('hack.n.03'),
  9. Synset('hack.n.04'), Synset('cab.n.03'), Synset('hack.n.06'), Synset('hack.n.07'),
  10. Synset('hack.n.08'), Synset('chop.v.05'), Synset('hack.v.02'), Synset('hack.v.03'),
  11. Synset('hack.v.04'), Synset('hack.v.05'), Synset('hack.v.06'), Synset('hack.v.07'), Synset('hack.v.08')]
  12. result2:[Synset('chop.v.05'), Synset('hack.v.02'), Synset('hack.v.03'), Synset('hack.v.04'),
  13. Synset('hack.v.05'), Synset('hack.v.06'), Synset('hack.v.07'), Synset('hack.v.08')]
  14. '''
  15. 2. 获取近义词的定义
  16. defi = result1[1].definition() # 获取定义
  17. #defi结果: a politician who belongs to a small clique that controls a political party for private rather than public ends
  18. 3. 获取单词本身的定义
  19. defi = Word("octopus").definitions
  20. # ['tentacles of octopus prepared as food', 'bottom-living cephalopod having a soft oval body with eight long tentacles']

6. 拼写矫正

(1)直接矫正

  1. b = TextBlob("I havv goood speling!")
  2. b_corr = b.correct()
  3. print(b_corr) # I have good spelling!

(2)Word 拼写检查

      word.spellcheck()方法,返回带有拼写建议的(word,confidence)元组列表

  1. from textblob import Word
  2. w = Word('falibility')
  3. w_ = w.spellcheck()
  4. print(w_) # [('fallibility', 1.0)]

7. 单词词频

(1)单词词频

  1. monty = TextBlob("We are no longer the Knights who say Ni. "
  2. "We are now the Knights who say Ekki ekki ekki PTANG.")
  3. #(1)方式1
  4. counts = monty.word_counts['ekki'] # 不区分大小写
  5. print(counts) # 3
  6. #(2)方式2
  7. counts2 = monty.words.count('ekki')
  8. print(counts2) # 3
  9. #(3)方式3
  10. counts3 = monty.words.count('ekki', case_sensitive=True) # 设置大小写敏感,默认不区分
  11. print(counts3) # 2

(2)短语频次

  1. counts4 = wiki.noun_phrases.count('python') # 短语频次
  2. print(counts4) # 1

8. 翻译及语言检测语言

  1. en_blob = TextBlob('Simple is better than complex.')
  2. lang = en_blob.translate(to='es') # from_lang默认 en
  3. print(lang)
  4. # TextBlob("Simple es mejor que complejo.")
  5. chinese_blob = TextBlob("美丽优于丑陋")
  6. lang = chinese_blob.translate(from_lang="zh-CN", to='en')
  7. print(lang)
  8. # TextBlob("Beautiful is better than ugly")

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/盐析白兔/article/detail/482898
推荐阅读
  

闽ICP备14008679号