赞
踩
本教程,我们将展示如何使用torchtext库,构建文本分类分析用的数据集。用户将能灵活做以下几项:
torch.Tensor
that can be used to train the modeltorch.utils.data.DataLoader
torchtext
库提供了一些raw dataset iterators,能够yield the raw strings.例如,AG_NEWS
dataset iterators yield the raw data as a tuple of label and text.
# 运行本课程中介绍的,会出现connectError错误
import torch
from torchtext.datasets import AG_NEWS
train_iter = AG_NEWS(split='train')
以上代码返回错误 ConnectionError: ('Connection aborted.', OSError(22, 'Invalid argument'))
因此,我在网上下载了AG_NEWS数据集,下载地址:
https://download.csdn.net/download/hanfeixue2001/16261579?spm=1001.2014.3001.5501
from torchtext.utils import unicode_csv_reader
import io
def read_iter(path):
with io.open(path, encoding='utf-8') as f:
reader = unicode_csv_reader(f)
for row in reader:
yield int(row[0]), ' '.join(row[1:])
train_path = './AG_NEWS/train.csv'
test_path = './AG_NEWS/test.csv'
train_iter = read_iter(train_path)
next(train_iter)
(3,
"Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.")
next(train_iter)
(3,
'Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.')
我们将会使用torchtext库最基本的组件,包括vocab、word vectors、tokenizer。对 raw text strings 做基本的数据预处理操作。
这里给出用tokenizer 和 vocabulary做NLP数据预处理的例子。第一步,用raw training dataset构建vocabulary。用户可以通过在Vocab类的构造函数中设置参数,来自定义vocab。例如:参数min_freq,表示包含tokens的最小频数要求。
from torchtext.data.utils import get_tokenizer
from collections import Counter
from torchtext.vocab import Vocab
# get_tokenizer函数的作用是创建一个分词器,根据不同分词函数的规则完成分词
# 分词器支持’basic_english’,‘spacy’,‘moses’,‘toktok’,‘revtok’,'subword’等规则
tokenizer = get_tokenizer('basic_english')
train_iter = read_iter(train_path)
counter = Counter()
for label, line in train_iter:
# 将语料喂给相应的分词器
counter.update(tokenizer(line))
# Create a Vocab object from a collections.Counter
vocab = Vocab(counter, min_freq=1)
# vocab有三个属性,分别是freqs、stoi、itos,下面展示一个属性
vocab.itos[:5]
['<unk>', '<pad>', '.', 'the', ',']
# convert token into integer(token进行数值化处理,即每个token都有唯一索引去替代)
[vocab[token] for token in ['here', 'is', 'an', 'example']]
[476, 22, 31, 5298]
准备带有tokenizer和vocabulary的the text processing pipeline。
The text 和 label pipelines 将被用于处理raw data strings
text_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]
label_pipeline = lambda x: int(x) - 1
text_pipeline('here is an example')
[476, 22, 31, 5298]
label_pipeline('3')
2
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。