赞
踩
NLP小样本之迁移学习来做5-shot分类是一种有效的方法,特别适用于在数据稀缺的情况下进行文本分类任务。在这种场景下,我们通常只有极少量的标记数据(5个样本)可用于训练模型,迁移学习则可以帮助我们利用其他相关数据源的知识来提升模型性能。
AG_NEWS:新闻语料库,包含4个大类新闻:World、Sports、Business、Sci/Tec。
AG_NEWS共包含120000条训练样本集(train.csv), 7600测试样本数据集(test.csv)。每个类别分别拥有 30000 个训练样本及 1900 个测试样本。
BERT是由 Google 在 2018 年提出的一种预训练语言模型。与传统的语言模型只能从左到右或者从右到左单向预测下一个词不同, BERT 使用了 Transformer 模型,并且在预训练阶段使用了双向的上下文信息。BERT 的预训练分为两个阶段: Masked Language Model ( MLM )和 Next Sentence Prediction( NSP )。在 MLM 阶段,BERT 在输入的句子中随机遮盖一些词汇,然后通过上下文信息来预测这些被遮盖的词汇。在 NSP 阶段, BERT 输入一对句子,并判断这两个句子是否是连续的。在预训练完成后, BERT 可以进行下游任务的微调,如文本分类、命名实体识别、自然语言推理等。通过微调, BERT 可以将其学习到的语言表示应用于各种自然语言处理任务中。BERT 的优点包括:
5-shot分类,顾名思义就是用每类数据集中的5个样本进行训练,剩下的样本进行测试
import numpy as np import pandas as pd import sklearn from simpletransformers.classification import ClassificationModel def train(train_file, test_file): return # Reading the train and test files train_df = pd.read_csv(train_file) test_df = pd.read_csv(test_file) # torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113 # using PyTorch and BERT and hence we are not doing any text preprocessing df = pd.DataFrame() df['text'] = train_df['Description'] df['label'] = train_df['Class Index'] # Since the labels are starting from 1 to N - we need to map it to 0 to N-1 df['label'] = df['label'].apply(lambda x : x -1) # 5 represents the number of samples extracted from each class for training small_df = df.groupby('label').apply(lambda x: x.sample(5, replace=True)).reset_index(drop=True) # Configure the simple transformer for classificating the text # select the bert model you want to train model = ClassificationModel('bert', 'bert-base-cased', num_labels=4, args={'reprocess_input_data':True, 'overwrite_output_dir':True, 'num_train_epochs':80, 'learning_rate':5e-5, 'train_batch_size':20, 'eval_batch_size':20 }, use_cuda=False ) # Lets begin our training of the model on a smaller dataset model.train_model(small_df) # Lets prepare our evalution data set dt = pd.DataFrame() dt['text'] = test_df['Description'] dt['label'] = test_df['Class Index'] dt['label'] = dt['label'].apply(lambda x : x -1) small_dt = dt.groupby('label').apply(lambda x: x.sample(5, replace=True)).reset_index(drop=True) # Evaluate the model result, model_outputs, wrong_predictions = model.eval_model(small_dt) predicted = [] for arr in model_outputs: predicted.append(np.argmax(arr)) true = small_dt['label'].tolist() print(sklearn.metrics.classification_report(true, predicted, target_names=['World','Sports','Business','Sci/Tech'])) if __name__=="__main__": train_file = "data/train.csv" test_file = "data/test.csv" train(train_file, test_file)
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。