当前位置:   article > 正文

nlp数据预处理(torchext和spacy)_torch spacy

torch spacy

安装torchext时出现问题

错误:2020年10月之后,安装或更新软件包时可能会遇到错误。这是因为pip将改变它解决依赖冲突的方式。
我们建议您使用–use feature=2020 resolver在新的解析器成为默认解析器之前用它测试您的包。
TorchVision0.4.2要求torch1.3.1,但您将使用torch 1.8.0,这是不兼容的。
torch
1.3.1
ERROR: After October 2020 you may experience errors when installing or updating packages. This is because pip will change the way that it resolves dependency conflicts.

We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.

torchvision 0.4.2 requires torch==1.3.1, but you’ll have torch 1.8.0 which is incompatible.
Collecting dataclasses; python_version < “3.7”
Using cached dataclasses-0.8-py3-none-any.whl (19 kB)
Installing collected packages: dataclasses, torch, torchtext
Attempting uninstall: torch
Found existing installation: torch 1.3.1
Can’t uninstall ‘torch’. No files were found to uninstall.

初始化函数:torch.nn.init

# -*- coding: utf-8 -*-
"""
Created on 2019

@author: fancp
"""

import torch 
import torch.nn as nn

w = torch.empty(3,5)

#1.均匀分布 - u(a,b)
#torch.nn.init.uniform_(tensor, a=0.0, b=1.0)
print(nn.init.uniform_(w))
# =============================================================================
# tensor([[0.9160, 0.1832, 0.5278, 0.5480, 0.6754],
#         [0.9509, 0.8325, 0.9149, 0.8192, 0.9950],
#         [0.4847, 0.4148, 0.8161, 0.0948, 0.3787]])
# =============================================================================

#2.正态分布 - N(mean, std)
#torch.nn.init.normal_(tensor, mean=0.0, std=1.0)
print(nn.init.normal_(w))
# =============================================================================
# tensor([[ 0.4388,  0.3083, -0.6803, -1.1476, -0.6084],
#         [ 0.5148, -0.2876, -1.2222,  0.6990, -0.1595],
#         [-2.0834, -1.6288,  0.5057, -0.5754,  0.3052]])
# =============================================================================

#3.常数 - 固定值 val
#torch.nn.init.constant_(tensor, val)
print(nn.init.constant_(w, 0.3))
# =============================================================================
# tensor([[0.3000, 0.3000, 0.3000, 0.3000, 0.3000],
#         [0.3000, 0.3000, 0.3000, 0.3000, 0.3000],
#         [0.3000, 0.3000, 0.3000, 0.3000, 0.3000]])
# =============================================================================

#4.全1分布
#torch.nn.init.ones_(tensor)
print(nn.init.ones_(w))
# =============================================================================
# tensor([[1., 1., 1., 1., 1.],
#         [1., 1., 1., 1., 1.],
#         [1., 1., 1., 1., 1.]])
# =============================================================================

#5.全0分布
#torch.nn.init.zeros_(tensor)
print(nn.init.zeros_(w))
# =============================================================================
# tensor([[0., 0., 0., 0., 0.],
#         [0., 0., 0., 0., 0.],
#         [0., 0., 0., 0., 0.]])
# =============================================================================

#6.对角线为 1,其它为 0
#torch.nn.init.eye_(tensor)
print(nn.init.eye_(w))
# =============================================================================
# tensor([[1., 0., 0., 0., 0.],
#         [0., 1., 0., 0., 0.],
#         [0., 0., 1., 0., 0.]])
# =============================================================================

#7.xavier_uniform 初始化
#torch.nn.init.xavier_uniform_(tensor, gain=1.0)
#From - Understanding the difficulty of training deep feedforward neural networks - Bengio 2010
print(nn.init.xavier_uniform_(w, gain=nn.init.calculate_gain('relu')))
# =============================================================================
# tensor([[-0.1270,  0.3963,  0.9531, -0.2949,  0.8294],
#         [-0.9759, -0.6335,  0.9299, -1.0988, -0.1496],
#         [-0.7224,  0.2181, -1.1219,  0.8629, -0.8825]])
# =============================================================================

#8.xavier_normal 初始化
#torch.nn.init.xavier_normal_(tensor, gain=1.0)
print(nn.init.xavier_normal_(w))
# =============================================================================
# tensor([[ 1.0463,  0.1275, -0.3752,  0.1858,  1.1008],
#         [-0.5560,  0.2837,  0.1000, -0.5835,  0.7886],
#         [-0.2417,  0.1763, -0.7495,  0.4677, -0.1185]])
# =============================================================================

#9.kaiming_uniform 初始化
#torch.nn.init.kaiming_uniform_(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu')
#From - Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification - HeKaiming 2015
print(nn.init.kaiming_uniform_(w, mode='fan_in', nonlinearity='relu'))
# =============================================================================
# tensor([[-0.7712,  0.9344,  0.8304,  0.2367,  0.0478],
#         [-0.6139, -0.3916, -0.0835,  0.5975,  0.1717],
#         [ 0.3197, -0.9825, -0.5380, -1.0033, -0.3701]])
# =============================================================================

#10.kaiming_normal 初始化
#torch.nn.init.kaiming_normal_(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu')
print(nn.init.kaiming_normal_(w, mode='fan_out', nonlinearity='relu'))
# =============================================================================
# tensor([[-0.0210,  0.5532, -0.8647,  0.9813,  0.0466],
#         [ 0.7713, -1.0418,  0.7264,  0.5547,  0.7403],
#         [-0.8471, -1.7371,  1.3333,  0.0395,  1.0787]])
# =============================================================================

#11.正交矩阵 - (semi)orthogonal matrix
#torch.nn.init.orthogonal_(tensor, gain=1)
#From - Exact solutions to the nonlinear dynamics of learning in deep linear neural networks - Saxe 2013
print(nn.init.orthogonal_(w))
# =============================================================================
# tensor([[-0.0346, -0.7607, -0.0428,  0.4771,  0.4366],
#         [-0.0412, -0.0836,  0.9847,  0.0703, -0.1293],
#         [-0.6639,  0.4551,  0.0731,  0.1674,  0.5646]])
# =============================================================================

#12.稀疏矩阵 - sparse matrix 
#torch.nn.init.sparse_(tensor, sparsity, std=0.01)
#From - Deep learning via Hessian-free optimization - Martens 2010
print(nn.init.sparse_(w, sparsity=0.1))
# =============================================================================
# tensor([[ 0.0000,  0.0000, -0.0077,  0.0000, -0.0046],
#         [ 0.0152,  0.0030,  0.0000, -0.0029,  0.0005],
#         [ 0.0199,  0.0132, -0.0088,  0.0060,  0.0000]])
# =============================================================================

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 110
  • 111
  • 112
  • 113
  • 114
  • 115
  • 116
  • 117
  • 118
  • 119
  • 120
  • 121
  • 122
  • 123
  • 124

初识torchtext

看了三个经典案例:

A,GitHub德语到英语的翻译
B,博客Pytorch】【torchtext(二)】Field详解
C,B站文本生成案例

B,
1,Field:可以理解为这是一个金刚钻,可以给数据加上,…还可以弄到相同长度,还可以定义切分的方式。

TEXT = Field(sequential=True, lower=True, fix_length=10,tokenize=str.split,batch_first=True)
LABEL = Field(sequential=False, use_vocab=False)
  • 1
  • 2

2,金刚钻和数据一起丢给example,数据就处理好了

for text,label in zip(corpus,labels):
    example = Example.fromlist([text,label],fields=fields)
    examples.append(example)
  • 1
  • 2
  • 3

3, 将处理好的数据丢给BucketIterator.splits

这个和上面两步不是同一个项目中的,这个是德英翻译,上面的是博客讲解
train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE, 
    device = device)
  • 1
  • 2
  • 3
  • 4
  • 5

源代码:

# 1.数据
corpus = ["D'aww! He matches this background colour",
         "Yo bitch Ja Rule is more succesful then",
         "If you have a look back at the source"]
labels = [0,1,0]
# 2.定义不同的Field
TEXT = Field(sequential=True, lower=True, fix_length=10,tokenize=str.split,batch_first=True)
LABEL = Field(sequential=False, use_vocab=False)
fields = [("comment", TEXT),("label",LABEL)]
# 3.将数据转换为Example对象的列表
examples = []
for text,label in zip(corpus,labels):
    example = Example.fromlist([text,label],fields=fields)
    examples.append(example)
print(type(examples[0]))
print(examples[0].comment)
print(examples[0].label)
# 4.构建词表
new_corpus = [example.comment for example in examples]
TEXT.build_vocab(new_corpus)
print(TEXT.process(new_corpus))

  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22

大致流程:
1,读取文件

# 设置表头
fields = [('score', None), ('id',None), ('date',None), ('query',None),
          ('name',None),('tweet',TWEET), ('category',None), ('label',LABEL)]

# 读取数据
twitterDataset = data.TabularDataset(
    path = 'training-processed.csv',
    format = 'CSV',
    fields = fields,
    skip_header = False
)

# 分离 train, test, val
train, test, val = twitterDataset.split(split_ratio=[0.8, 0.1, 0.1], stratified=True, strata_field='label')
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

2,做词表啦,切分啦

LABEL = data.LabelField() # 标签
TWEET = data.Field(lower=True) # 内容/文本

# 构建词汇表
vocab_size = 20000
TWEET.build_vocab(train, max_size=vocab_size)
LABEL.build_vocab(train)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

3,好吧, 我描述不清楚了,先放着吧,之后再回来看
总之就是文本的预处理的很好用的一个包

参考1
参考2

torchtext.data.Field

这个一直出问题,用不了,现在依然解决不了,不能运行代码了,智能看看教程看看这么用就好了
torchtext.data.Field参数介绍
后来发现在kaggle 的notebook上没有这个问题,可以运行!!!而且cpu和gpu都可
可能就是我的pytorch和torchtextd 版本问题。
因为github这里提到:此仓库仅适用于要求PyTorch 1.8或更高版本的torchtext 0.9或更高版本

nn.embeddingbag()

把一个句子中的所有词转变为词向量以后,同时求一个均值(或者和或者最大值)

初识scapy

在github德语翻译为英语案例中,使用scapy是为了定义tokenizer函数

官网

1
2

类似的东西还有CNTK和core NLP

参考

代码

语法树

声明:本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:【wpsshop博客】
推荐阅读
相关标签
  

闽ICP备14008679号