nlp数据预处理（torchext和spacy）_torch spacy

作者：喵喵爱编程 | 2024-07-21 19:26:27

踩

torch spacy

安装torchext时出现问题

错误：2020年10月之后，安装或更新软件包时可能会遇到错误。这是因为pip将改变它解决依赖冲突的方式。
我们建议您使用–use feature=2020 resolver在新的解析器成为默认解析器之前用它测试您的包。
TorchVision0.4.2要求torch1.3.1，但您将使用torch 1.8.0，这是不兼容的。
torch1.3.1
ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

torchvision 0.4.2 requires torch==1.3.1, but you’ll have torch 1.8.0 which is incompatible.
Collecting dataclasses; python_version < “3.7”
Using cached dataclasses-0.8-py3-none-any.whl (19 kB)
Installing collected packages: dataclasses, torch, torchtext
Attempting uninstall: torch
Found existing installation: torch 1.3.1
Can’t uninstall ‘torch’. No files were found to uninstall.

初始化函数：torch.nn.init

# -*- coding: utf-8 -*-
"""
Created on 2019

@author: fancp
"""

import torch 
import torch.nn as nn

w = torch.empty(3,5)

#1.均匀分布 - u(a,b)
#torch.nn.init.uniform_(tensor, a=0.0, b=1.0)
print(nn.init.uniform_(w))
# =============================================================================
# tensor([[0.9160, 0.1832, 0.5278, 0.5480, 0.6754],
#         [0.9509, 0.8325, 0.9149, 0.8192, 0.9950],
#         [0.4847, 0.4148, 0.8161, 0.0948, 0.3787]])
# =============================================================================

#2.正态分布 - N(mean, std)
#torch.nn.init.normal_(tensor, mean=0.0, std=1.0)
print(nn.init.normal_(w))
# =============================================================================
# tensor([[ 0.4388,  0.3083, -0.6803, -1.1476, -0.6084],
#         [ 0.5148, -0.2876, -1.2222,  0.6990, -0.1595],
#         [-2.0834, -1.6288,  0.5057, -0.5754,  0.3052]])
# =============================================================================

#3.常数 - 固定值 val
#torch.nn.init.constant_(tensor, val)
print(nn.init.constant_(w, 0.3))
# =============================================================================
# tensor([[0.3000, 0.3000, 0.3000, 0.3000, 0.3000],
#         [0.3000, 0.3000, 0.3000, 0.3000, 0.3000],
#         [0.3000, 0.3000, 0.3000, 0.3000, 0.3000]])
# =============================================================================

#4.全1分布
#torch.nn.init.ones_(tensor)
print(nn.init.ones_(w))
# =============================================================================
# tensor([[1., 1., 1., 1., 1.],
#         [1., 1., 1., 1., 1.],
#         [1., 1., 1., 1., 1.]])
# =============================================================================

#5.全0分布
#torch.nn.init.zeros_(tensor)
print(nn.init.zeros_(w))
# =============================================================================
# tensor([[0., 0., 0., 0., 0.],
#         [0., 0., 0., 0., 0.],
#         [0., 0., 0., 0., 0.]])
# =============================================================================

#6.对角线为 1，其它为 0
#torch.nn.init.eye_(tensor)
print(nn.init.eye_(w))
# =============================================================================
# tensor([[1., 0., 0., 0., 0.],
#         [0., 1., 0., 0., 0.],
#         [0., 0., 1., 0., 0.]])
# =============================================================================

#7.xavier_uniform 初始化
#torch.nn.init.xavier_uniform_(tensor, gain=1.0)
#From - Understanding the difficulty of training deep feedforward neural networks - Bengio 2010
print(nn.init.xavier_uniform_(w, gain=nn.init.calculate_gain('relu')))
# =============================================================================
# tensor([[-0.1270,  0.3963,  0.9531, -0.2949,  0.8294],
#         [-0.9759, -0.6335,  0.9299, -1.0988, -0.1496],
#         [-0.7224,  0.2181, -1.1219,  0.8629, -0.8825]])
# =============================================================================

#8.xavier_normal 初始化
#torch.nn.init.xavier_normal_(tensor, gain=1.0)
print(nn.init.xavier_normal_(w))
# =============================================================================
# tensor([[ 1.0463,  0.1275, -0.3752,  0.1858,  1.1008],
#         [-0.5560,  0.2837,  0.1000, -0.5835,  0.7886],
#         [-0.2417,  0.1763, -0.7495,  0.4677, -0.1185]])
# =============================================================================

#9.kaiming_uniform 初始化
#torch.nn.init.kaiming_uniform_(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu')
#From - Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification - HeKaiming 2015
print(nn.init.kaiming_uniform_(w, mode='fan_in', nonlinearity='relu'))
# =============================================================================
# tensor([[-0.7712,  0.9344,  0.8304,  0.2367,  0.0478],
#         [-0.6139, -0.3916, -0.0835,  0.5975,  0.1717],
#         [ 0.3197, -0.9825, -0.5380, -1.0033, -0.3701]])
# =============================================================================

#10.kaiming_normal 初始化
#torch.nn.init.kaiming_normal_(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu')
print(nn.init.kaiming_normal_(w, mode='fan_out', nonlinearity='relu'))
# =============================================================================
# tensor([[-0.0210,  0.5532, -0.8647,  0.9813,  0.0466],
#         [ 0.7713, -1.0418,  0.7264,  0.5547,  0.7403],
#         [-0.8471, -1.7371,  1.3333,  0.0395,  1.0787]])
# =============================================================================

#11.正交矩阵 - (semi)orthogonal matrix
#torch.nn.init.orthogonal_(tensor, gain=1)
#From - Exact solutions to the nonlinear dynamics of learning in deep linear neural networks - Saxe 2013
print(nn.init.orthogonal_(w))
# =============================================================================
# tensor([[-0.0346, -0.7607, -0.0428,  0.4771,  0.4366],
#         [-0.0412, -0.0836,  0.9847,  0.0703, -0.1293],
#         [-0.6639,  0.4551,  0.0731,  0.1674,  0.5646]])
# =============================================================================

#12.稀疏矩阵 - sparse matrix 
#torch.nn.init.sparse_(tensor, sparsity, std=0.01)
#From - Deep learning via Hessian-free optimization - Martens 2010
print(nn.init.sparse_(w, sparsity=0.1))
# =============================================================================
# tensor([[ 0.0000,  0.0000, -0.0077,  0.0000, -0.0046],
#         [ 0.0152,  0.0030,  0.0000, -0.0029,  0.0005],
#         [ 0.0199,  0.0132, -0.0088,  0.0060,  0.0000]])
# =============================================================================

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124

初识torchtext

看了三个经典案例：

A，GitHub德语到英语的翻译
B，博客Pytorch】【torchtext(二)】Field详解
C，B站文本生成案例

B，
1,Field:可以理解为这是一个金刚钻，可以给数据加上,…还可以弄到相同长度，还可以定义切分的方式。

TEXT = Field(sequential=True, lower=True, fix_length=10,tokenize=str.split,batch_first=True)
LABEL = Field(sequential=False, use_vocab=False)
1
2

2，金刚钻和数据一起丢给example,数据就处理好了

for text,label in zip(corpus,labels):
    example = Example.fromlist([text,label],fields=fields)
    examples.append(example)
1
2
3

3, 将处理好的数据丢给BucketIterator.splits

这个和上面两步不是同一个项目中的，这个是德英翻译，上面的是博客讲解
train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE, 
    device = device)
1
2
3
4
5

源代码：

# 1.数据
corpus = ["D'aww! He matches this background colour",
         "Yo bitch Ja Rule is more succesful then",
         "If you have a look back at the source"]
labels = [0,1,0]
# 2.定义不同的Field
TEXT = Field(sequential=True, lower=True, fix_length=10,tokenize=str.split,batch_first=True)
LABEL = Field(sequential=False, use_vocab=False)
fields = [("comment", TEXT),("label",LABEL)]
# 3.将数据转换为Example对象的列表
examples = []
for text,label in zip(corpus,labels):
    example = Example.fromlist([text,label],fields=fields)
    examples.append(example)
print(type(examples[0]))
print(examples[0].comment)
print(examples[0].label)
# 4.构建词表
new_corpus = [example.comment for example in examples]
TEXT.build_vocab(new_corpus)
print(TEXT.process(new_corpus))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

大致流程：
1，读取文件

# 设置表头
fields = [('score', None), ('id',None), ('date',None), ('query',None),
          ('name',None),('tweet',TWEET), ('category',None), ('label',LABEL)]

# 读取数据
twitterDataset = data.TabularDataset(
    path = 'training-processed.csv',
    format = 'CSV',
    fields = fields,
    skip_header = False
)

# 分离 train, test, val
train, test, val = twitterDataset.split(split_ratio=[0.8, 0.1, 0.1], stratified=True, strata_field='label')
1
2
3
4
5
6
7
8
9
10
11
12
13
14

2，做词表啦，切分啦

LABEL = data.LabelField() # 标签
TWEET = data.Field(lower=True) # 内容/文本

# 构建词汇表
vocab_size = 20000
TWEET.build_vocab(train, max_size=vocab_size)
LABEL.build_vocab(train)
1
2
3
4
5
6
7

3，好吧，我描述不清楚了，先放着吧，之后再回来看
总之就是文本的预处理的很好用的一个包

参考1
参考2

torchtext.data.Field

这个一直出问题，用不了，现在依然解决不了，不能运行代码了，智能看看教程看看这么用就好了
torchtext.data.Field参数介绍
后来发现在kaggle 的notebook上没有这个问题，可以运行！！！而且cpu和gpu都可
可能就是我的pytorch和torchtextd 版本问题。
因为github这里提到：此仓库仅适用于要求PyTorch 1.8或更高版本的torchtext 0.9或更高版本

nn.embeddingbag()

把一个句子中的所有词转变为词向量以后，同时求一个均值（或者和或者最大值）

初识scapy

在github德语翻译为英语案例中，使用scapy是为了定义tokenizer函数

官网

1
2

类似的东西还有CNTK和core NLP

参考

代码

语法树

声明：本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：【wpsshop博客】