当前位置:   article > 正文

Python文本分析(精读笔记1)

python文本分析

一、文本切分

1.句子切分:将文本语料库分解为句子的过程

句子切分技术,使用NLTK 框架进行切分,该框架提供用于执行句子切分的各种接口,有sent_tokenize , PunktSentenceTokenizer, RegexpTokenizer, 预先训练的句子切分模型

  1. import nltk
  2. from pprint import pprint#pprin和print功能基本一样,pprint打印出的数据结构更加完整,采用分行打印
  3. sample_text='We will discuss briefly about the basic syntax,structure and design philosophies. There is a defined hierarchical syntax for Python code which you should remember when writing code! Python is a really powerful programming lanuage!'
  4. #方法一
  5. sample_sentences=nltk.sent_tokenize(text=sample_text)
  6. #方法二
  7. punkt_st=nltk.tokenize.PunktSentenceTokenizer()
  8. sample_sentences=punkt_st.tokenize(sample_text)
  9. pprint(sample_sentences)
  10. 》》》》
  11. ['We will discuss briefly about the basic syntax,structure and design '
  12. 'philosophies.',
  13. 'There is a defined hierarchical syntax for Python code which you should '
  14. 'remember when writing code!',
  15. 'Python is a really powerful programming lanuage!']

注:在使用以上句子切分器时遇到了nltk.download('punkt')的问题,我查看了一博主nltk.download()下载失败问题解决方法,解决了问题,如遇到类似问题可以参考,非常不错!nltk.download()下载失败问题解决方法_高冷男孩不吃苹果的博客-CSDN博客_nltk下载失败nltk.download()下载失败问题解决方法https://blog.csdn.net/lcf0000/article/details/121849782?utm_medium=distribute.pc_aggpage_search_result.none-task-blog-2~aggregatepage~first_rank_ecpm_v1~rank_v31_ecpm-4-121849782.pc_agg_new_rank&utm_term=nltk.download%28punkt%29&spm=1000.2123.3001.4430

2.词语切分:将句子分割为其组成单词的过程。 

依然在nltk框架下,主流接口有word_tokenize, TreebankWordTokenizer,RegexpTokenizer,RegexpTokenizer继承的切分器

  1. import nltk
  2. sentence="The brown for wasn't that quick and he couldn't win the race"
  3. words=nltk.word_tokenize(sentence)
  4. print(words)
  5. treebank_wk=nltk.TreebankWordTokenizer()
  6. words=treebank_wk.tokenize(sentence)
  7. print(words)
  8. 《《《《《
  9. print(words)
  10. ['The', 'brown', 'for', 'was', "n't", 'that', 'quick', 'and', 'he', 'could', "n't", 'win', 'the', 'race']

二、文本规范化

一下代码加载基本依存关系和将使用的语料库

  1. import nltk
  2. import re
  3. import string
  4. from pprint import pprint
  5. corpus=["The brown fox wasn't that quick and couldn't win the race","Hey that's a great deal! I just bought a phone for $199", "@@You'll(learn) a **lot** in the book . Python is amazing language!@@"]

1.文本清洗:删除无关不必要标识和字符

2.文本切分:

  1. import nltk
  2. import re
  3. import string
  4. from pprint import pprint
  5. corpus=["The brown fox wasn't that quick and couldn't win the race","Hey that's a great deal! I just bought a phone for $199", "@@You'll(learn) a **lot** in the book . Python is amazing language!@@"]
  6. #文本切分
  7. def tokenize_tex(text):
  8. sentences=nltk.sent_tokenize(text)
  9. word_tokens=[nltk.word_tokenize(sentence) for sentence in sentences]
  10. return word_tokens
  11. token_list=[tokenize_tex(text) for text in corpus]
  12. pprint(token_list)
  13. >>>>
  14. [[['The',
  15. 'brown',
  16. 'fox',
  17. 'was',
  18. "n't",
  19. 'that',
  20. 'quick',
  21. 'and',
  22. 'could',
  23. "n't",
  24. 'win',
  25. 'the',
  26. 'race']],
  27. [['Hey', 'that', "'s", 'a', 'great', 'deal', '!'],
  28. ['I', 'just', 'bought', 'a', 'phone', 'for', '$', '199']],
  29. [['@',
  30. '@',
  31. 'You',
  32. "'ll",
  33. '(',
  34. 'learn',
  35. ')',
  36. 'a',
  37. '*',
  38. '*',
  39. 'lot',
  40. '*',
  41. '*',
  42. 'in',
  43. 'the',
  44. 'book',
  45. '.'],
  46. ['Python', 'is', 'amazing', 'language', '!'],
  47. ['@', '@']]]

3.删除特殊字符

  1. def remove_characters_after_tokenization(tokens):
  2. pattern=re.compile('[{}]'.format(re.escape(string.punctuation)))#删除特殊字符
  3. filtered_tokens=list(filter(None,[pattern.sub('',token) for token in tokens]))
  4. return filtered_tokens
  5. #按以下书上的代码,我没跑出来
  6. filtered_list_1=[filter(None,[remove_characters_after_tokenization(tokens) for tokens in sentence_tokens]) for sentence_tokens in token_list]
  7. print(filtered_list_1)
  8. #这是我修改以后的代码
  9. sentence_list=[]
  10. for sentence_tokens in token_list:
  11. for tokens in sentence_tokens:
  12. print(tokens)
  13. sentence_list.append(remove_characters_after_tokenization(tokens))
  14. >>>>>
  15. #结果已经不含特殊字符了
  16. [['The',
  17. 'brown',
  18. 'fox',
  19. 'was',
  20. 'nt',
  21. 'that',
  22. 'quick',
  23. 'and',
  24. 'could',
  25. 'nt',
  26. 'win',
  27. 'the',
  28. 'race'],
  29. ['Hey', 'that', 's', 'a', 'great', 'deal'],
  30. ['I', 'just', 'bought', 'a', 'phone', 'for', '199'],
  31. ['You', 'll', 'learn', 'a', 'lot', 'in', 'the', 'book'],
  32. ['Python', 'is', 'amazing', 'language'],
  33. []]

4.扩展缩写词 

  1. import contractions
  2. from contractions import CONTRACTION_MAP
  3. def expand_contractions(sentence,contraction_mapping):
  4. contractions_pattern=re.compile('({})'.format('|'.join(contraction_mapping.keys())),flags=re.IGNORECASE|re.DOTALL)
  5. def expand_match(contraction):
  6. match=contraction.group(0)
  7. first_char=match[0]
  8. expanded_contraction=contraction_mapping.get(match)\
  9. if contraction_mapping.get(match)\
  10. else contraction_mapping.get(match.lower())
  11. expanded_contraction=first_char+expanded_contraction[1:]
  12. return expanded_contraction
  13. expanded_sentence=contractions_pattern.sub(expand_match,sentence)
  14. return expanded_sentence
  15. expanded_corpus=[expand_contractions(sentence,CONTRACTION_MAP) for sentence in sentence_list]
  16. print(expanded_corpus)

5.大小写转换

  1. print(corpus[0].lower())
  2. print(corpus[0].upper())
  3. >>>>>
  4. the brown fox wasn't that quick and couldn't win the race
  5. THE BROWN FOX WASN'T THAT QUICK AND COULDN'T WIN THE RACE

6.删除停用词(删除没有或者极小意义的词)

  1. def remove_stopwords(tokens):
  2. stopword_list=nltk.corpus.stopwords.words('english')
  3. filtered_tokens=[token.lower() for token in tokens if token.lower() not in stopword_list]
  4. return filtered_tokens
  5. corpus_tokens=[tokenize_text(text) for text in corpus]#先用前文定义的tokenize_text函数分割文章
  6. filted_list_3=[[remove_stopwords(tokens) for tokens in sentence_tokens] for sentence_tokens in corpus_tokens]
  7. >>>>>>对比以下结果
  8. stopword_list#都是以小写字母展示
  9. Out[69]:
  10. ['i',
  11. 'me',
  12. 'my',
  13. 'myself',
  14. 'we',
  15. 'our',
  16. 'ours',
  17. 'ourselves',
  18. 'you',
  19. "you're",
  20. "you've",
  21. "you'll",
  22. corpus_tokens
  23. Out[68]:
  24. [[['The',
  25. 'brown',
  26. 'fox',
  27. 'was',
  28. "n't",
  29. 'that',
  30. 'quick',
  31. 'and',
  32. 'could',
  33. "n't",
  34. 'win',
  35. 'the',
  36. 'race']],
  37. [['Hey', 'that', "'s", 'a', 'great', 'deal', '!'],
  38. ['I', 'just', 'bought', 'a', 'phone', 'for', '$', '199']],
  39. [['@',
  40. '@',
  41. 'You',
  42. "'ll",
  43. '(',
  44. 'learn',
  45. ')',
  46. 'a',
  47. '*',
  48. '*',
  49. 'lot',
  50. '*',
  51. '*',
  52. 'in',
  53. 'the',
  54. 'book',
  55. '.'],
  56. ['Python', 'is', 'amazing', 'language', '!'],
  57. ['@', '@']]]
  58. filted_list_3
  59. Out[67]:
  60. [[['brown', 'fox', "n't", 'quick', 'could', "n't", 'win', 'race']],
  61. [['hey', "'s", 'great', 'deal', '!'], ['bought', 'phone', '$', '199']],
  62. [['@', '@', "'ll", '(', 'learn', ')', '*', '*', 'lot', '*', '*', 'book', '.'],
  63. ['python', 'amazing', 'language', '!'],
  64. ['@', '@']]]

7.校正重复字符

非正式的英文表达中常有重复字符的情况

  1. #校正重复字符
  2. import nltk
  3. import re
  4. from nltk.corpus import wordnet
  5. def remove_repleated_characters(tokens):
  6. repeat_pattern=re.compile(r'(\w*)(\w)\2(\w*)')#用该模式识别单词中两个不同 之间的重复字符
  7. match_substition=r'\1\2\3'#用置换方法消除一个重复字符
  8. def replace(old_word):
  9. if wordnet.synsets(old_word):
  10. return old_word#判断单词是否存在在语料库中,存在则保留
  11. new_word=repeat_pattern.sub(match_substition,old_word)
  12. return replace(new_word) if new_word!=old_word else new_word
  13. correct_tokens=[replace(word) for word in tokens]
  14. return correct_tokens
  15. sample_sentences="My school is reallllly amaaazningggg"
  16. sample_sentence=tokenize_text(sample_sentences)
  17. print(remove_repleated_characters(sample_sentence[0]))
  18. >>>>>>>>
  19. sample_sentence
  20. Out[24]: [['My', 'school', 'is', 'reallllly', 'amaaazningggg']]
  21. print(remove_repleated_characters(sample_sentence[0]))
  22. ['My', 'school', 'is', 'really', 'amazning']

8.词干提取

词干是单词的基本形式,可通过在词干上添加词缀来创建新词,词干不一定是标准正确的单词。nltk包中含几种实现算法:PorterStemmer ;LancasterStemmer;RegexpStemmer;SnowballStemmber,各算法实现方法不一致

  1. #词干提取PorterStemmer
  2. from nltk.stem import PorterStemmer
  3. ps=PorterStemmer()
  4. print(ps.stem('jumping'),ps.stem('jumps'),ps.stem('jumped'),ps.stem('lying'),ps.stem('strange'))
  5. >>>>>>
  6. jump jump jump lie strang

9.词形还原

  1. #词形还原,词元(lemma)始终在词典中
  2. from nltk.stem import WordNetLemmatizer
  3. wnl=WordNetLemmatizer()
  4. print(wnl.lemmatize('cars','n'))
  5. print(wnl.lemmatize('running','v'))
  6. >>>>>>
  7. car
  8. run

以上就是处理,规范化,标准化文本的内容。

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/AllinToyou/article/detail/568711
推荐阅读
相关标签
  

闽ICP备14008679号