赞
踩
- 在当今的大数据时代,文本分类任务在许多领域都有着广泛的应用,如情感分析、垃圾邮件过滤、主题分类等。为了有效地处理这些任务,我们通常需要构建一个强大的文本分类模型。在本篇博客中,我们将使用Python和PyTorch库来构建一个简单的文本分类大模型,并探讨其实现过程。
在开始之前,确保你已经安装了Python和PyTorch
pip install torch torchvision
- import torch
- from torchtext.legacy import data
- from torchtext.vocab import GloVe
-
- # 定义字段
- TEXT = data.Field(tokenize='spacy', tokenizer_language='en_core_web_sm')
- LABEL = data.LabelField(dtype=torch.float)
-
- # 下载GloVe词嵌入
- GLOVE_DIR = 'path/to/glove/directory'
- glove = GloVe(GLOVE_DIR, '6B', text_field=TEXT)
- TEXT.build_vocab(glove)
- LABEL.build_vocab(train)
-
- # 划分数据集
- train_data, valid_data, test_data = data.TabularDataset.splits(path='.', train='train.csv', validation='valid.csv', test='test.csv', format='csv', skip_header=True, fields=[('text', TEXT), ('label', LABEL)])
- import torch.nn as nn
- import torch.nn.functional as F
-
- class TextClassificationModel(nn.Module):
- def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
- super(TextClassificationModel, self).__init__()
- self.embedding = nn.Embedding(vocab_size, embedding_dim)
- self.rnn = nn.RNN(embedding_dim, hidden_dim)
- self.fc = nn.Linear(hidden_dim, output_dim)
- self.softmax = nn.LogSoftmax(dim=1)
-
- def forward(self, text):
- embedded = self.embedding(text)
- output, hidden = self.rnn(embedded)
- concatenated = torch.cat((hidden[-1], output[:,-1]), 1) # Concatenate the last hidden state and the output of the last time step
- output = self.fc(concatenated) # Fully connected layer to get log probabilities over classes (output layer) with softmax activation function applied to it for multi-class classification task.
- output = self.softmax(output) # Softmax function to get probabilities for each class for each sample in the mini-batch
- return output, hidden # We will use the last hidden state for generating captions in sequence generation task
在构建了模型之后,我们需要对其进行训练和评估,以下是一个简单的训练和评估过程:
- # 定义超参数
- embedding_dim = 100
- hidden_dim = 200
- output_dim = 2 # 假设我们有两个类别
- lr = 0.01
- epochs = 10
-
- # 实例化模型
- model = TextClassificationModel(len(TEXT.vocab), embedding_dim, hidden_dim, output_dim)
- criterion = nn.NLLLoss() # Negative log likelihood loss
- optimizer = torch.optim.Adam(model.parameters(), lr=lr) # Adam optimizer with learning rate of 0.01
-
- # 训练模型
- for epoch in range(epochs):
- for batch in train_data:
- optimizer.zero_grad() # Reset gradients tensor
- output = model(batch.text)[0] # Forward pass
- loss = criterion(output, batch.label) # Compute loss
- loss.backward() # Backward pass: compute gradients
- optimizer.step() # Update parameters
在训练完成后,我们可以使用测试集来评估模型的性能:
- model.eval() # Set model to evaluation mode (dropout layers are turned off)
- correct = 0
- total = 0
- with torch.no_grad(): # We don't need to compute gradients during evaluation
- for batch in test_data:
- output = model(batch.text)[0]
- _, predicted = torch.max(output, 1) # Get the most likely class (index)
- total += batch.label.size(0) # Total number of samples in the batch
- correct += (predicted == batch.label).sum().item() # Count the number of correct predictions
- acc = 100 * correct / total # Calculate accuracy in percentage
- print(f'Accuracy: {acc}%')
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。