赞
踩
自然语言处理NLP
是人工智能和计算机科学的一个子领域,专注于计算机与人类(自然)语言之间的互动。其目的是使计算机能够理解、解释和生成人类语言。NLP 涉及语言学、计算机科学和人工智能的多学科交叉,通过统计、机器学习和深度学习等方法处理和分析大量的自然语言数据。
NLP 包括多种任务和应用,主要分为以下几类:
分词:将文本分割成独立的单词或短语。
词性标注:标识每个单词在句子中的词性(如名词、动词、形容词等)。
句法分析:解析句子的语法结构,包括依存关系和短语结构分析。
情感分析:检测文本中的情感倾向,如积极、中立或消极。
主题建模:识别文本中的主题或隐藏的语义结构。
垃圾邮件过滤:检测并过滤电子邮件中的垃圾内容。
命名实体识别(NER):识别文本中的实体,如人名、地名、组织等。
关系抽取:从文本中提取实体之间的关系。
事件抽取:识别和分类文本中的事件信息。
自动翻译:将文本从一种语言翻译成另一种语言,如 Google 翻译。
语言生成:生成与输入语义相关的自然语言文本。
摘要生成:从长文本中提取关键内容生成摘要。
聊天机器人:与用户进行自然语言对话,如客服机器人。
智能助理:提供信息查询、任务管理等服务,如 Siri、Alexa。
早期的 NLP 方法主要依赖于统计模型,如 n-gram 模型、隐马尔可夫模型(HMM)和条件随机场(CRF),用于各种语言处理任务。
传统机器学习方法,如支持向量机(SVM)、朴素贝叶斯、决策树等被广泛应用于文本分类、情感分析等任务。
近年来,深度学习技术在 NLP 中取得了显著进展,主要包括:
循环神经网络(RNN):特别是长短期记忆(LSTM)和门控循环单元(GRU)被用于处理序列数据,如文本生成和机器翻译。
卷积神经网络(CNN):用于文本分类和句子建模。
Transformer:由 Google 提出的 Transformer 结构及其衍生模型(如 BERT、GPT)在多种 NLP 任务中表现优异。
NLTK:Python 的自然语言处理库,提供丰富的工具和资源。
spaCy:高效的 NLP 库,适用于工业应用。
Gensim:用于主题建模和文档相似性计算的库。
Transformers:Hugging Face 提供的库,包含多种预训练模型,如 BERT、GPT 等。
NLTK
(Natural Language Toolkit)是一个广泛使用的开源 Python 库,专门用于处理自然语言文本。它提供了丰富的工具和资源,用于完成各种自然语言处理(NLP)任务,包括文本预处理、词性标注、句法分析、语义分析、机器翻译等。NLTK 适用于教育和研究领域,同时也是入门 NLP 的理想工具。
NLTK 包含多个模块和子包,提供了各种 NLP 功能。以下是一些核心组件和功能:
分词(Tokenization):将文本分割成独立的单词或句子。
- # 导入 NLTK 库
- import nltk
-
- # 下载 punkt 数据包,用于分句和分词
- nltk.download('punkt')
-
- # 定义一个句子
- sentence = "Natural language processing is fun."
-
- # 使用 NLTK 的 word_tokenize 函数对句子进行分词
- # word_tokenize 函数将输入的字符串按单词进行分割,生成一个单词列表
- words = nltk.word_tokenize(sentence)
-
- # 打印分词后的结果
- # 结果是一个包含句子中每个单词的列表
- print(words)
-
- # 输出:
- '''
- ['Natural', 'language', 'processing', 'is', 'fun', '.']
- '''
去除停用词(Stopword Removal):去除无意义的常见词(如 "the", "is")。
- # 从 NLTK 的语料库模块中导入 stopwords
- from nltk.corpus import stopwords
-
- # 下载 stopwords 数据包,包含各种语言的常见停用词
- nltk.download('stopwords')
-
- # 获取英语的停用词集合
- stop_words = set(stopwords.words('english'))
-
- # 过滤掉分词结果中的停用词
- # 对于每个单词 w,如果该单词(转换为小写后)不在停用词集合中,则保留该单词
- filtered_words = [w for w in words if not w.lower() in stop_words]
-
- # 打印过滤后的单词列表
- # 结果是一个去除了停用词的单词列表
- print(filtered_words)
-
- # 输出:
- '''
- ['Natural', 'language', 'processing', 'fun', '.']
- '''
词干提取(Stemming):将单词还原为词干形式。
- # 从 NLTK 的 stem 模块中导入 PorterStemmer
- from nltk.stem import PorterStemmer
-
- # 初始化 PorterStemmer 对象
- stemmer = PorterStemmer()
-
- # 对分词结果中的每个单词进行词干提取
- # stemmer.stem(w) 方法会提取单词的词干
- stemmed_words = [stemmer.stem(w) for w in words]
-
- # 打印词干提取后的单词列表
- # 结果是一个包含每个单词词干形式的列表
- print(stemmed_words)
-
- # 输出:
- '''
- ['natur', 'languag', 'process', 'is', 'fun', '.']
- '''
词形还原(Lemmatization):将单词还原为其基本形式。
- # 从 NLTK 的 stem 模块中导入 WordNetLemmatizer
- from nltk.stem import WordNetLemmatizer
-
- # 下载 wordnet 数据包,包含用于词形还原的词典
- nltk.download('wordnet')
-
- # 初始化 WordNetLemmatizer 对象
- lemmatizer = WordNetLemmatizer()
-
- # 对分词结果中的每个单词进行词形还原
- # lemmatizer.lemmatize(w) 方法会将单词还原为其基本形式
- lemmatized_words = [lemmatizer.lemmatize(w) for w in words]
-
- # 打印词形还原后的单词列表
- # 结果是一个包含每个单词基本形式的列表
- print(lemmatized_words)
-
- # 输出:
- '''
- ['Natural', 'language', 'processing', 'is', 'fun', '.']
- '''
标注句子中的每个单词的词性:
- # 下载词性标注器的模型数据包
- nltk.download('averaged_perceptron_tagger')
-
- # 对分词结果进行词性标注
- # nltk.pos_tag(words) 方法会为每个单词分配词性标签
- tagged_words = nltk.pos_tag(words)
-
- # 打印词性标注后的单词列表
- # 结果是一个包含单词及其词性标签的元组列表
- print(tagged_words)
-
- # 输出:
- '''
- [('Natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('is', 'VBZ'), ('fun', 'NN'), ('.', '.')]
- '''
识别句子中的命名实体:
- # 下载用于命名实体识别的模型数据包
- nltk.download('maxent_ne_chunker')
-
- # 下载 words 数据包,包含用于命名实体识别的词典
- nltk.download('words')
-
- # 使用词性标注后的单词列表进行命名实体识别
- # nltk.chunk.ne_chunk(tagged_words) 方法会识别句子中的命名实体
- entities = nltk.chunk.ne_chunk(tagged_words)
-
- # 打印命名实体识别后的结果
- # 结果是一个包含标记的命名实体的树结构
- print(entities)
-
- # 输出:
- '''
- (S Natural/JJ language/NN processing/NN is/VBZ fun/NN ./.)
- '''
解析句子的语法结构:
- # 从 NLTK 导入上下文无关文法(CFG)模块
- from nltk import CFG
-
- # 定义上下文无关文法(CFG)
- grammar = CFG.fromstring("""
- S -> NP VP # 句子 S 由名词短语 NP 和动词短语 VP 组成
- VP -> V NP # 动词短语 VP 由动词 V 和名词短语 NP 组成
- NP -> 'John' | 'Mary' | 'Bob' # 名词短语 NP 由三个名字中的一个组成
- V -> 'loves' | 'hates' # 动词 V 包含 'loves' 和 'hates'
- """)
-
- # 使用定义的语法创建一个 ChartParser
- parser = nltk.ChartParser(grammar)
-
- # 将句子分词
- sentence = "John loves Mary".split()
-
- # 使用解析器解析句子
- for tree in parser.parse(sentence):
- # 打印解析得到的树结构
- print(tree)
-
- # 输出:
- '''
- (S (NP John) (VP (V loves) (NP Mary)))
- '''
NLTK 提供了丰富的语料库和词典资源,涵盖了各种语言和应用领域。主要语料库包括:Gutenberg Corpus(经典文学作品,如《白鲸》)、Brown Corpus(平衡的英语语料库)、Reuters Corpus(新闻文档)、Inaugural Address Corpus(美国总统就职演说)、Movie Reviews Corpus(影评文本)、Web Text Corpus(互联网文本)、Shakespeare Corpus(莎士比亚戏剧文本)和 Treebank Corpus(句法树和词性标注的文本)。主要词典资源包括:WordNet(大型英语词典数据库)、Names Corpus(常见男性和女性名字)、Stopwords Corpus(多种语言的停用词列表)、Swadesh Corpus(基本词汇列表)、CMU Pronouncing Dictionary(英语发音词典)和 Opinion Lexicon(正面和负面情感词汇列表)。这些资源为各种自然语言处理任务提供了基础数据支持。:
- # 从 NLTK 的语料库模块中导入 gutenberg 语料库
- from nltk.corpus import gutenberg
-
- # 下载 gutenberg 语料库
- nltk.download('gutenberg')
-
- # 获取 'melville-moby_dick.txt' 文件的原始文本内容
- sample = gutenberg.raw('melville-moby_dick.txt')
-
- # 打印前 500 个字符的文本内容
- print(sample[:500])
-
- # 输出是从 NLTK 的 Gutenberg 语料库中提取并打印了《白鲸》(Moby Dick)前 500 个字符的内容。:
- '''
- [Moby Dick by Herman Melville 1851]
- ETYMOLOGY.
- (Supplied by a Late Consumptive Usher to a Grammar School)
- The pale Usher--threadbare in coat, heart, body, and brain; I see him
- now. He was ever dusting his old lexicons and grammars, with a queer
- handkerchief, mockingly embellished with all the gay flags of all the
- known nations of the world. He loved to dust his old grammars; it
- somehow mildly reminded him of his mortality.
- "While you take in hand to school others, and to teac
- '''
以下是一个完整的示例,展示了如何使用 NLTK 进行文本预处理和词性标注:
- import nltk
- import re
- from nltk.corpus import stopwords
- from nltk.stem import PorterStemmer, WordNetLemmatizer
-
- # 下载必要的 NLTK 数据包
- nltk.download('punkt')
- nltk.download('stopwords')
- nltk.download('averaged_perceptron_tagger')
- nltk.download('wordnet')
-
- # 加载文本
- text = "NLTK is a leading platform for building Python programs to work with human language data."
-
- # 分词
- words = nltk.word_tokenize(text)
-
- # 去除停用词
- stop_words = set(stopwords.words('english'))
- filtered_words = [w for w in words if not w.lower() in stop_words]
-
- # 词干提取
- stemmer = PorterStemmer()
- stemmed_words = [stemmer.stem(w) for w in filtered_words]
-
- # 词形还原
- lemmatizer = WordNetLemmatizer()
- lemmatized_words = [lemmatizer.lemmatize(w) for w in filtered_words]
-
- # 词性标注
- tagged_words = nltk.pos_tag(lemmatized_words)
-
- print("分词:", words)
- print("去除停用词:", filtered_words)
- print("词干提取:", stemmed_words)
- print("词形还原:", lemmatized_words)
- print("词性标注:", tagged_words)
- import nltk
- from nltk.book import * # 导入 NLTK 书中的所有内容
- import matplotlib.pyplot as plt
-
- # 下载所需的资源包
- nltk.download('genesis')
- nltk.download('book')
- nltk.download('inaugural')
- nltk.download('punkt')
- nltk.download('averaged_perceptron_tagger')
- nltk.download('maxent_ne_chunker')
- nltk.download('words')
- nltk.download('stopwords')
- nltk.download('wordnet')
-
- # 搜索文本
- text1.concordance("monstrous")
- text2.concordance("affection")
- text3.concordance("lived")
- text5.concordance("lol") # 聊天记录
-
- # 搜索相似词
- text1.similar("monstrous")
- print("-----分割线----")
- text2.similar("monstrous") # 不同文本中的相似词
-
- # 搜索共同上下文
- text2.common_contexts(["monstrous","very"])
-
- # 词汇分布图
- text4.dispersion_plot(["citizens","democracy","freedom","duties","America"]) # 救济演说语料
- # 随着时间发展,单词出现的频率
-
- # 使用 text1 生成文本(需要 text1 已经导入)
- generated_text = text1.generate(50) # 生成 50 个单词的文本
- print(generated_text)
-
- # 计数词汇
- len(text3)
- print(len(text3))
- sorted(set(text3)) # 排序
- len(set(text3)) # 去重
-
- #重复词密度
- print(len(text3)/len(set(text3))) # 平均每个标识符出现16次
-
- # 关键词密度
- print(text3.count("smote"))
- print(100* text4.count("a")/len(text4))
- def lexical_diversity(text):
- return len(text)/len(set(text))
- def percentage(count,total):
- return 100*count/total
- print(lexical_diversity(text3))
- print("text5 diversity is {}".format(lexical_diversity(text5)))
- print("in text4,a percentage {}%".format(percentage(text4.count("a"),len(text4))))
-
-
- # 创建词汇频率分布
- fdist1 = FreqDist(text1)
-
- # 打印词汇频率分布的摘要
- print(fdist1)
-
- # 获取词汇表
- vocabulary1 = fdist1.keys()
-
- # 可视化频率分布
- plt.figure(figsize=(10, 6))
- fdist1.plot(30, cumulative=False) # 绘制前30个词汇的频率分布
- plt.show()
-
- # 输出:
- '''
- Displaying 11 of 11 matches:
- ong the former , one was of a most monstrous size . ... This came towards us ,
- ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
- ll over with a heathenish array of monstrous clubs and spears . Some were thick
- d as you gazed , and wondered what monstrous cannibal and savage could ever hav
- that has survived the flood ; most monstrous and most mountainous ! That Himmal
- they might scout at Moby Dick as a monstrous fable , or still worse and more de
- th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
- ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
- ere to enter upon those still more monstrous stories of them which are to be fo
- ght have been rummaged out of this monstrous cabinet there is no telling . But
- of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
- Displaying 25 of 79 matches:
- , however , and , as a mark of his affection for the three girls , he left them
- t . It was very well known that no affection was ever supposed to exist between
- deration of politeness or maternal affection on the side of the former , the tw
- d the suspicion -- the hope of his affection for me may warrant , without impru
- hich forbade the indulgence of his affection . She knew that his mother neither
- rd she gave one with still greater affection . Though her late conversation wit
- can never hope to feel or inspire affection again , and if her home be uncomfo
- m of the sense , elegance , mutual affection , and domestic comfort of the fami
- , and which recommended him to her affection beyond every thing else . His soci
- ween the parties might forward the affection of Mr . Willoughby , an equally st
- the most pointed assurance of her affection . Elinor could not be surprised at
- he natural consequence of a strong affection in a young and ardent mind . This
- opinion . But by an appeal to her affection for her mother , by representing t
- every alteration of a place which affection had established as perfect with hi
- e will always have one claim of my affection , which no other can possibly shar
- f the evening declared at once his affection and happiness . " Shall we see you
- ause he took leave of us with less affection than his usual behaviour has shewn
- ness ." " I want no proof of their affection ," said Elinor ; " but of their en
- onths , without telling her of his affection ;-- that they should part without
- ould be the natural result of your affection for her . She used to be all unres
- distinguished Elinor by no mark of affection . Marianne saw and listened with i
- th no inclination for expense , no affection for strangers , no profession , an
- till distinguished her by the same affection which once she had felt no doubt o
- al of her confidence in Edward ' s affection , to the remembrance of every mark
- was made ? Had he never owned his affection to yourself ?" " Oh , no ; but if
- Displaying 25 of 38 matches:
- ay when they were created . And Adam lived an hundred and thirty years , and be
- ughters : And all the days that Adam lived were nine hundred and thirty yea and
- nd thirty yea and he died . And Seth lived an hundred and five years , and bega
- ve years , and begat Enos : And Seth lived after he begat Enos eight hundred an
- welve years : and he died . And Enos lived ninety years , and begat Cainan : An
- years , and begat Cainan : And Enos lived after he begat Cainan eight hundred
- ive years : and he died . And Cainan lived seventy years and begat Mahalaleel :
- rs and begat Mahalaleel : And Cainan lived after he begat Mahalaleel eight hund
- years : and he died . And Mahalaleel lived sixty and five years , and begat Jar
- s , and begat Jared : And Mahalaleel lived after he begat Jared eight hundred a
- and five yea and he died . And Jared lived an hundred sixty and two years , and
- o years , and he begat Eno And Jared lived after he begat Enoch eight hundred y
- and two yea and he died . And Enoch lived sixty and five years , and begat Met
- ; for God took him . And Methuselah lived an hundred eighty and seven years ,
- , and begat Lamech . And Methuselah lived after he begat Lamech seven hundred
- nd nine yea and he died . And Lamech lived an hundred eighty and two years , an
- ch the LORD hath cursed . And Lamech lived after he begat Noah five hundred nin
- naan shall be his servant . And Noah lived after the flood three hundred and fi
- xad two years after the flo And Shem lived after he begat Arphaxad five hundred
- at sons and daughters . And Arphaxad lived five and thirty years , and begat Sa
- ars , and begat Salah : And Arphaxad lived after he begat Salah four hundred an
- begat sons and daughters . And Salah lived thirty years , and begat Eber : And
- y years , and begat Eber : And Salah lived after he begat Eber four hundred and
- begat sons and daughters . And Eber lived four and thirty years , and begat Pe
- y years , and begat Peleg : And Eber lived after he begat Peleg four hundred an
- Displaying 25 of 822 matches:
- ast PART 24 / m boo . 26 / m and sexy lol U115 boo . JOIN PART he drew a girl w
- ope he didnt draw a penis PART ewwwww lol & a head between her legs JOIN JOIN s
- a bowl i got a blunt an a bong ...... lol JOIN well , glad it worked out my cha
- e " PART Hi U121 in ny . ACTION would lol @ U121 . . . but appearently she does
- 30 make sure u buy a nice ring for U6 lol U7 Hi U115 . ACTION isnt falling for
- didnt ya hear !!!! PART JOIN geeshhh lol U6 PART hes deaf ppl here dont get it
- es nobody here i wanna misbeahve with lol JOIN so read it . thanks U7 .. Im hap
- ies want to chat can i talk to him !! lol U121 !!! forwards too lol JOIN ALL PE
- k to him !! lol U121 !!! forwards too lol JOIN ALL PErvs ... redirect to U121 '
- loves ME the most i love myself JOIN lol U44 how do u know that what ? jerkett
- ng wrong ... i can see it in his eyes lol U20 = fiance Jerketts lmao wtf yah I
- cooler by the minute what 'd I miss ? lol noo there too much work ! why not ??
- that mean I want you ? U6 hello room lol U83 and this .. has been the grammar
- the rule he 's in PM land now though lol ah ok i wont bug em then someone wann
- flight to hell :) lmao bbl maybe PART LOL lol U7 it was me , U83 hahah U83 ! 80
- ht to hell :) lmao bbl maybe PART LOL lol U7 it was me , U83 hahah U83 ! 808265
- 082653953 K-Fed got his ass kicked .. Lol . ACTION laughs . i got a first class
- . i got a first class ticket to hell lol U7 JOIN any texas girls in here ? any
- . whats up U155 i was only kidding . lol he 's a douchebag . Poor U121 i 'm bo
- ??? sits with U30 Cum to my shower . lol U121 . ACTION U1370 watches his nads
- ur nad with a stick . ca u U23 ewwww lol *sniffs* ewwwwww PART U115 ! owww spl
- ACTION is resisting . ur female right lol U115 beeeeehave Remember the LAst tim
- pm's me . charge that is 1.99 / min . lol @ innocent hahah lol .... yeah LOLOLO
- is 1.99 / min . lol @ innocent hahah lol .... yeah LOLOLOLLL U12 thats not nic
- s . lmao no U115 Check my record . :) Lol lick em U7 U23 how old r u lol Way to
- true contemptible christian abundant few part mean careful puzzled
- mystifying passing curious loving wise doleful gamesome singular
- delightfully perilous fearless
- -----分割线----
- very so exceedingly heartily a as good great extremely remarkably
- sweet vast amazingly
- am_glad a_pretty a_lucky is_pretty be_glad
- long , from one to the top - mast , and no coffin and went out a sea
- captain -- this peaking of the whales . , so as to preserve all his
- might had in former years abounding with them , they toil with their
- lances , strange tales
- long , from one to the top - mast , and no coffin and went out a sea
- captain -- this peaking of the whales . , so as to preserve all his
- might had in former years abounding with them , they toil with their
- lances , strange tales
- 44764
- 16.050197203298673
- 5
- 1.457806031353621
- 16.050197203298673
- text5 diversity is 7.420046158918563
- in text4,a percentage 1.457806031353621%
- <FreqDist with 19317 samples and 260819 outcomes>
- <Figure size 640x480 with 1 Axes>
- <Figure size 1000x600 with 1 Axes>
- '''
可视化输出:
词汇分布图 显示了特定单词在文本中的出现位置,便于分析这些单词在不同部分的分布情况。
词频分布图 则展示了文本中高频词汇的出现次数,帮助识别和分析文本中的重要词汇和常用词汇的使用频率。
NLTK 是一个强大且灵活的自然语言处理工具包,适用于学术研究和教育。它提供了丰富的工具和资源,可以完成从文本预处理到高级语言分析的各种任务。通过结合使用 NLTK 的各种功能,我们可以构建复杂的自然语言处理应用程序,并深入理解语言数据。
以上内容总结自网络,如有帮助欢迎转发,我们下次再见!
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。