小惠珠哦

这个屌丝很懒，什么也没留下！

热门标签

Python 与语言处理_python 的指代消解包

作者：小惠珠哦 | 2024-08-06 06:03:48

踩

python 的指代消解包

NLTK

NLTK 是一个 Python 的自然语言处理库。它可以从 http://www.nltk.org// 上免费下载。

安装完成后，需要输入两行数据来安装所需的数据：

>>>import nltk
>>>nltk.download()
# 当然也可以有其他的方式，例如直接从官网下载数据包，后解压到对应的文件夹。
1
2
3

输入完上述指令后，会跳出一个 GUI 界面，然后选择自己需要的数据包下载即可。
（在我的 macOS 环境下，输入 download 指令会出现重启bug, 所以我选择自行下载所需要的数据包）

首先来看一段指令：

from nltk.book import *  #从nltk 的 book 模块中加载所需要的条目
#运行结果：
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

#运行后，我们可以通过输入其变量名的方式找到所需文本：
>>>text1
<Text: Moby Dick by Herman Melville 1851>
# text1 wei 《白鲸记》
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

如何查看上述文档的文本内容？
方法一：直接阅读查看/寻找。
方法二：见下代码：

# 词语索引视图可以显示指定单词的出现情况，同时还可以显示一些上下文
>>>text1.concordance("monstrous")
# 显示 text1 中的单词 monstrous
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u	
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

在上述运行结果中，我们看到 "monstrous"出现在文章中，我们如何查询上下文中相似的单词词？

>>>text1.similar("monstrous")
true contemptible christian abundant few part mean careful puzzled
mystifying passing curious loving wise doleful gamesome singular
delightfully perilous fearless
1
2
3
4

对于两个词汇，如何查询上下文中公用的词汇？

>>>text2.common_contexts(["monstrous",'very'])
a_pretty am_glad a_lucky is_pretty be_glad
1
2

对于某一个特定的单词，如何查询在某一位置上出现的频率？我们可以用一个离散图来展示这些信息，每一列代表一个单词、每一行代表一个文本。

>>>text4.dispersion_plot(['America','freedom'])
# 为了正确运行，需要安装 matplotlib 库
1
2

其运行结果如下：
在这里插入图片描述

如何以单词/标点符号为单位计算出文章的长度？可以使用 len 获取长度。

>>>len(text3)
44764
#这说明text3中有 44764 个单词和标点符号，也被称作“标识符”。
1
2
3

标识符是表示一组字符序列（如 hairy, his 等）的术语。
例如：“to be or not to be” 中有 6 个单词，但不同的单词只有 4 个。

如何统计不同的单词数？

>>>len(set(text3))
2789
# set 是一个集合容器
1
2
3

如何统计某个单词在文本中出现的次数？

>>>text3.count('he')
648
1
2

将文本当作词链表

文本是什么？
文本无外乎是单词和标点符号组成的序列。

例如：

sent1 = ['Call', 'me', 'Ishmael', '.']
sent2 = ['the', 'family', 'of', 'Dashwood' ,'had', 'long', 'been', 'settled', 'in' ,'Sussex', '.']
1
2
'运行

上述代码中 sent1, sent2 可以看作两个序列变量，它们支持 len(), sorted() 等函数；另外，它们还支持加法操作，即拼接两个字符串。

>>>sent1 + sent2
['Call', 'me', 'Ishmael', '.', 'the', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex', '.']
# 上述加法操作被称为链接
1
2
3

如何在一个序列中添加一个元素？
可以使用 append(), insert() 函数。

如何查看指定位置上的单词？
可以使用索引：

>>>text1[25:30]	#左闭右开区间
['--', 'threadbare', 'in', 'coat', ',']
# 它们的下标分别为 25,26,27,28,29

>>>text1[124]
'the'
# 下标索引为 124 的单词
1
2
3
4
5
6
7

如何修改指定下标的单词？
可以通过修改索引值来修改序列中的元素：

>>>sent1[1] = 'you' 
# 将下标为 1 的元素修改为 'you'
>>>sent1
['Call', 'you', 'Ishmael', '.']
1
2
3
4

在上述例子中，我们使用 ’ ’ 标记的内容被称为字符串。
例如：

>>>s = 'hello world'
1

字符串支持索引、切片、相加、相乘等基本操作。

简单的统计

我们要解决的问题是：如何使用 Python 来处理文本。

在 nltk 中，我们介绍了一些最基本的库函数操作；在文本序列中，我们介绍了一些序列和字符串的基本概念。

这里，我们期待一些更实际的应用。

对于一篇文章而言，如何体现它的风格与特点？我们可以抽取其中出现频率最多的 50 个词。
那么问题是：如何构建每个词的频率分布？

一个方法是为每一个词设置一个计数器。nltk 为我们提供了内置的函数：

#例如对于 text1.
>>>FreqDist(text1)
FreqDist({',': 18713, 'the': 13721, '.': 6862, 'of': 6536, 'and': 6024, 'a': 4569, 'to': 4542, ';': 4072, 'in': 3916, 'that': 2982, ...})
#FreqDist 函数构建了一个字典对，给出了每个单词以及其出现的频率

>>>FreqDist(text1).plot(50, cumulative = True)
# 运行结果见下图
# 一项徒刑为 text1 中的最常用词的累计频率图
1
2
3
4
5
6
7
8

在这里插入图片描述

以上为高频词汇。

如何查看只出现一次的词语呢？

>>>FreqDist(text1).hapaxes()
# 运行结果太长了，因为它包含了所有只出现一次的词
1
2

如何查看一些特定条件下的词汇？例如要求单词的长度大于 17.

>>>[ w for w in set(text1) if len(w) > 17]
['uninterpenetratingly', 'characteristically']

# 将 text1 转化为集合，然后利用序列生成式：对于该结合中的每一个词 w, 都要检查其长度，若不满足条件则忽略。
1
2
3
4

单词的长度并不能决定文本的意义，为了寻找更加有意义的单词，可以寻找频率更高的词汇：

>>>f = FreqDist( text1)
>>>[w for w in set(text1) if f[w] > 50 and len(w) > 6]
['present', 'Leviathan', 'whaling', 'peculiar', 'turning', 'towards', 'sometimes', 'Queequeg', 'nothing', 'together', 'whalemen', 'Starbuck', 'Nantucket', 'themselves', 'looking', 'Tashtego', 'CHAPTER', 'standing', 'something', 'captain', 'Captain', 'however', 'harpooneers', 'harpoon', 'thought', 'another', 'fishery', 'through', 'therefore', 'strange', 'perhaps', 'between', 'certain', 'morning', 'curious', 'without', 'instant', 'business', 'because', 'further', 'beneath', 'himself', 'whether', 'against', 'thousand', 'harpooneer', 'general']
# 频率大于 50 且长度大于 6 的单词如上。
1
2
3
4

决策与控制

Python 可以由关系运算符来控制（准确来说是 True 和 False）。其控制语法有 for, while, if 等。

自动理解自然语言

词意消歧：分析处特定上下文中的词被赋予的是哪个意思。

例如： he served the dish.
然而，上句中：1. serve 有三个意思：help with food or drink; hold an office; put ball into play; 2. dish 有三个意思：plate; course of a meal; communications device.

上下文分析可以使得计算机明白 “he served the dish.” 一定与食物有关。

指代消解：检测动词的主语和宾语。

例如：

The thieves stole the paintings. They were subsequently sold;
The thieves stole the paintings. They were subsequently caught;
The thieves stole the paintings. They were subsequently found;

上述例句中，they 的先行词是不同的，依据不同的动词，我们可以区分出 1. 中 they 的先行词为 thieves, 2. 中的先行词为 paintings.

自动生成语言：

例如输入：Text: The thieves stole the paintings. They were subsequently sold;
Answer: Who or what was sold?
此时，计算机需要反馈 “paintings”.

机器翻译：

人机对话系统：

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/小惠珠哦/article/detail/936090