当前位置:   article > 正文

Pytorch笔记-6_ag_news下载错误

ag_news下载错误

Text Classification with the torchtext library

本教程,我们将展示如何使用torchtext库,构建文本分类分析用的数据集。用户将能灵活做以下几项:

  • Access to raw data as iterator
  • Build data processing pipeline to convert the raw text strings into torch.Tensor that can be used to train the model
  • Shuffle and iterate the data with torch.utils.data.DataLoader
1. Access to the raw dataset iterators

torchtext库提供了一些raw dataset iterators,能够yield the raw strings.例如,AG_NEWS dataset iterators yield the raw data as a tuple of label and text.

# 运行本课程中介绍的,会出现connectError错误
import torch
from torchtext.datasets import AG_NEWS
train_iter = AG_NEWS(split='train')
  • 1
  • 2
  • 3
  • 4

以上代码返回错误 ConnectionError: ('Connection aborted.', OSError(22, 'Invalid argument'))

因此,我在网上下载了AG_NEWS数据集,下载地址:

https://download.csdn.net/download/hanfeixue2001/16261579?spm=1001.2014.3001.5501

from torchtext.utils import unicode_csv_reader
import io
def read_iter(path):
    with io.open(path, encoding='utf-8') as f:
        reader = unicode_csv_reader(f)
        for row in reader:
            yield int(row[0]), ' '.join(row[1:])

            
train_path = './AG_NEWS/train.csv'
test_path = './AG_NEWS/test.csv'
train_iter = read_iter(train_path)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
next(train_iter)
  • 1
(3,
 "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.")
  • 1
  • 2
next(train_iter)
  • 1
(3,
 'Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.')
  • 1
  • 2
2. Prepare data processing pipelines

我们将会使用torchtext库最基本的组件,包括vocab、word vectors、tokenizer。对 raw text strings 做基本的数据预处理操作。

这里给出用tokenizer 和 vocabulary做NLP数据预处理的例子。第一步,用raw training dataset构建vocabulary。用户可以通过在Vocab类的构造函数中设置参数,来自定义vocab。例如:参数min_freq,表示包含tokens的最小频数要求。

from torchtext.data.utils import get_tokenizer
from collections import Counter
from torchtext.vocab import Vocab
# get_tokenizer函数的作用是创建一个分词器,根据不同分词函数的规则完成分词
# 分词器支持’basic_english’,‘spacy’,‘moses’,‘toktok’,‘revtok’,'subword’等规则
tokenizer = get_tokenizer('basic_english')
train_iter = read_iter(train_path)
counter = Counter()
for label, line in train_iter:
    # 将语料喂给相应的分词器
    counter.update(tokenizer(line))
# Create a Vocab object from a collections.Counter
vocab = Vocab(counter, min_freq=1)
# vocab有三个属性,分别是freqs、stoi、itos,下面展示一个属性
vocab.itos[:5]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
['<unk>', '<pad>', '.', 'the', ',']
  • 1
# convert token into integer(token进行数值化处理,即每个token都有唯一索引去替代)
[vocab[token] for token in ['here', 'is', 'an', 'example']]

  • 1
  • 2
  • 3
[476, 22, 31, 5298]
  • 1

准备带有tokenizer和vocabulary的the text processing pipeline。

The text 和 label pipelines 将被用于处理raw data strings

text_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]
label_pipeline = lambda x: int(x) - 1
  • 1
  • 2
text_pipeline('here is an example')
  • 1
[476, 22, 31, 5298]
  • 1
label_pipeline('3')
  • 1
2
  • 1
声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/你好赵伟/article/detail/344613
推荐阅读
相关标签
  

闽ICP备14008679号