（9-4-2）预训练模型：FinBERT_autotokenizer.from_pretrained("prosusai/finbert")

作者：我家小花儿 | 2024-04-03 08:04:23

踩

autotokenizer.from_pretrained("prosusai/finbert")

请大家关注我，本文章粉丝可见，我会一直更新下去,完整代码进QQ群获取：323140750，大家一起进步、学习。

（13）使用函数train_test_split从数据集中划分训练集（X_train和y_train）和验证集（X_val和y_val）。这些数据集可以用于模型的训练和验证，以评估模型的性能。具体实现代码如下所示。


X_train, X_val, y_train, y_val = train_test_split(financial_data.index.values,
                                                  financial_data.label.values,
                                                  test_size = 0.15,
                                                  random_state = 2022,
                                                  stratify = financial_data.label.values)

（14）在financial_data DataFrame 中为训练集（X_train）和验证集（X_val）的样本添加一个名为data_type的新列，并设置其值为'train'和'val'，以表示每个样本属于训练集或验证集。然后，计算了每种情感标签在不同数据类型（训练集或验证集）中的出现次数。具体实现代码如下所示。


financial_data.loc[X_train, 'data_type'] = 'train'
financial_data.loc[X_val, 'data_type'] = 'val'
financial_data.groupby(['sentiment', 'label', 'data_type']).count()

执行后会输出：


			NewsHeadline
sentiment	label	data_type	
negative	1	train	513
val	91
neutral	0	train	2447
val	432
positive	2	train	1159
                 val	204

通过执行上述代码，可以查看不同情感标签在训练集和验证集中的分布情况，以确保数据在不同数据类型之间的分布是合理的。这有助于了解数据集的分层情况，以便进行合理的模型训练和验证。

（15）开始查看新闻标题长度的分布，以确定在进行数据编码时需要设定的最大长度。首先使用库Hugging Face Transformers中的BertTokenizer类，从预训练模型"ProsusAI/finbert"中加载了一个金融领域的文本分词器（Tokenizer）。具体实现代码如下所示。

finbert_tokenizer = BertTokenizer.from_pretrained("ProsusAI/finbert",  do_lower_case=True)

执行后会输出：


Downloading: 100%|██████████|226k/226k [00:00<00:00, 807kB/s]
Downloading: 100%|██████████|112/112 [00:00<00:00, 4.33kB/s]
Downloading: 100%|██████████|252/252 [00:00<00:00, 10.1kB/s]
Downloading: 100%|██████████|758/758 [00:00<00:00, 11.3kB/s]

（16）调用函数get_headlines_len计算数据集中每个新闻标题的长度，并将结果存储在名为headlines_sequence_lengths的列表中。具体实现代码如下所示。

headlines_sequence_lengths = get_headlines_len(financial_data)

执行后会输出：


Encoding in progress...
 
100%|██████████| 4846/4846 [00:04<00:00, 1136.56it/s]
 
End of Task.

这个列表可以用于分析新闻标题长度的分布情况，也可以用于确定在进行文本编码时需要设置的最大序列长度，以便在模型训练中进行适当的填充或截断操作。

（17）调用之前定义的show_headline_distribution函数，用于显示新闻标题长度的分布情况。具体实现代码如下所示。

show_headline_distribution(headlines_sequence_lengths)

执行后的效果如图9-2所示，这个直方图有助于可视化新闻标题长度的分布，从而更好地了解新闻标题的长度范围和分布情况。

图9-2 新闻标题长度的分布情况

（18）对训练集和验证集的新闻标题数据进行编码和处理，以准备用于金融情感分类模型的训练和验证，以便进行后续的模型训练和评估。具体实现代码如下所示。


encoded_data_train = finbert_tokenizer.batch_encode_plus(
    financial_data[financial_data.data_type=='train'].NewsHeadline.values, 
    return_tensors='pt',
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=150 # the maximum lenght observed in the headlines
)
 
encoded_data_val = finbert_tokenizer.batch_encode_plus(
    financial_data[financial_data.data_type=='val'].NewsHeadline.values, 
    return_tensors='pt',
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=150 # the maximum lenght observed in the headlines
)
 
 
input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(financial_data[financial_data.data_type=='train'].label.values)
 
input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
sentiments_val = torch.tensor(financial_data[financial_data.data_type=='val'].label.values)
 
 
dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_val = TensorDataset(input_ids_val, attention_masks_val, sentiments_val)
 
len(sentiment_dict)

执行后会输出：


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
 
 
3

（19）使用库Hugging Face Transformers中的类AutoModelForSequenceClassification，从预训练模型"ProsusAI/finbert"中加载了一个用于序列分类（Sequence Classification）任务的模型。具体实现代码如下所示。


model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert",
 
                                                          num_labels=len(sentiment_dict),
 
                                                          output_attentions=False,
 
                                                          output_hidden_states=False)

通过加载这个模型，可以将其应用于金融情感分类任务，模型已经包含了预训练的权重和结构，只需要进行微调（Fine-tuning）以适应特定的任务。模型的输出将包括情感分类的预测结果。

（20）创建训练集和验证集的数据加载器（DataLoader），以便在模型训练和验证过程中批量加载数据，并在模型训练和验证过程中加载数据批次以进行前向传播和反向传播。具体实现代码如下所示。


batch_size = 5
dataloader_train = DataLoader(dataset_train,
                              sampler=RandomSampler(dataset_train),
                              batch_size=batch_size)
 
dataloader_validation = DataLoader(dataset_val,
                                   sampler=SequentialSampler(dataset_val),
                                   batch_size=batch_size)

（21）设置优化器和学习率调度器，以及指定训练的总周期数（epochs）。通过这些设置，可以在训练模型时使用AdamW优化器，并在每个训练周期结束后，根据学习率调度器动态调整学习率。具体实现代码如下所示。


optimizer = AdamW(model.parameters(),
                  lr=1e-5,
                  eps=1e-8)
epochs = 2
scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps=0,
                                            num_training_steps=len(dataloader_train)*epochs)

（22）开始训练金融情感分类模型，整个训练过程包括多个训练周期。在每个训练周期内，模型在训练数据上进行前向传播、损失计算和反向传播，然后更新模型参数。同时，会计算并打印训练损失、验证损失和F1分数，以监测模型的性能。模型的参数在每个训练周期结束后保存，以便后续的使用。具体实现代码如下所示。


seed_val = 2022
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)
 
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
 
for epoch in tqdm(range(1, epochs+1)): 
    model.train()  
    loss_train_total = 0
    progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
    for batch in progress_bar:
        model.zero_grad()    
        batch = tuple(b.to(device) for b in batch)     
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }       
 
        outputs = model(**inputs)  
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()     
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})          
    torch.save(model.state_dict(), f'finetuned_finBERT_epoch_{epoch}.model')
    tqdm.write(f'\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)            
    tqdm.write(f'Training loss: {loss_train_avg}')
    val_loss, predictions, true_vals = evaluate(dataloader_validation)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (Weighted): {val_f1}')

执行后输出显示训练过程：


  0%|          | 0/2 [00:00<?, ?it/s]
Epoch 1:   0%|          | 0/824 [00:00<?, ?it/s]
Epoch 1:   0%|          | 0/824 [00:01<?, ?it/s, training_loss=1.028]
Epoch 1:   0%|          | 1/824 [00:01<13:58,  1.02s/it, training_loss=1.028]
Epoch 1:   0%|          | 1/824 [00:01<13:58,  1.02s/it, training_loss=0.840]
Epoch 1:   0%|          | 2/824 [00:01<06:44,  2.03it/s, training_loss=0.840]
Epoch 1:   0%|          | 2/824 [00:01<06:44,  2.03it/s, training_loss=0.459]
Epoch 1:   0%|          | 3/824 [00:01<04:24,  3.10it/s, training_loss=0.459]
Epoch 1:   0%|          | 3/824 [00:01<04:24,  3.10it/s, training_loss=0.382]
Epoch 1:   0%|          | 4/824 [00:01<03:18,  4.13it/s, training_loss=0.382]
Epoch 1:   0%|          | 4/824 [00:01<03:18,  4.13it/s, training_loss=0.300]
######省略部分过程
Epoch 1: 100%|██████████| 824/824 [01:39<00:00,  8.74it/s, training_loss=0.006]
  0%|          | 0/2 [01:40<?, ?it/s]
Epoch 1
Training loss: 0.4581567718019004
######省略部分过程
 50%|█████     | 1/2 [01:44<01:44, 104.72s/it]
Validation loss: 0.3778555450533606
F1 Score (Weighted): 0.8753212258177056
Epoch 2:   0%|          | 0/824 [00:00<?, ?it/s]
Epoch 2:   0%|          | 0/824 [00:00<?, ?it/s, training_loss=0.004]
Epoch 2:   0%|          | 1/824 [00:00<01:32,  8.87it/s, training_loss=0.004]
Epoch 2:   0%|          | 1/824 [00:00<01:32,  8.87it/s, training_loss=0.005]
######省略部分过程
Epoch 2: 100%|█████████▉| 821/824 [01:38<00:00,  8.36it/s, training_loss=0.299]
Epoch 2: 100%|█████████▉| 821/824 [01:38<00:00,  8.36it/s, training_loss=0.003]
Epoch 2: 100%|█████████▉| 822/824 [01:38<00:00,  8.41it/s, training_loss=0.003]
Epoch 2: 100%|█████████▉| 822/824 [01:38<00:00,  8.41it/s, training_loss=0.002]
Epoch 2: 100%|█████████▉| 823/824 [01:38<00:00,  8.51it/s, training_loss=0.002]
Epoch 2: 100%|█████████▉| 823/824 [01:38<00:00,  8.51it/s, training_loss=0.003]
Epoch 2: 100%|██████████| 824/824 [01:38<00:00,  8.74it/s, training_loss=0.003]
 50%|█████     | 1/2 [03:24<01:44, 104.72s/it]
Epoch 2
Training loss: 0.2466177608593462
100%|██████████| 2/2 [03:28<00:00, 104.14s/it]
Validation loss: 0.43929754253413067
F1 Score (Weighted): 0.8823813944083021
 
Epoch 1
Training loss: 0.4581567718019004
Validation loss: 0.3778555450533606
F1 Score (Weighted): 0.8753212258177056
 
Epoch 2
Training loss: 0.2466177608593462
Validation loss: 0.43929754253413067
F1 Score (Weighted): 0.8823813944083021

在上述输出结果中，在第一个训练周期后，训练损失持续减小，而验证损失持续增加。这表明模型从第二个周期开始出现过拟合（Overfitting）的情况，因此在这种情况下，正确的做法是在第一个周期后停止训练。

过拟合是指模型在训练数据上表现良好，但在未见过的验证数据上表现较差的情况。因为模型在第一个周期已经开始出现过拟合迹象，继续训练可能会导致模型在验证集上的性能下降。停止训练以防止过拟合是一个明智的决策，可以保持模型在验证数据上的泛化性能。此外，可以通过其他技术如早停止（Early Stopping）来进一步优化模型的训练，以在适当的时机停止训练并保存性能最佳的模型。

（23）根据前面的输出结果可知，最佳模型是在第一个训练周期（Epoch 1）结束时的模型。要加载这个最佳模型以进行预测，可以执行以下操作：


model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert",
                                                          num_labels=len(sentiment_dict),
                                                          output_attentions=False,
                                                          output_hidden_states=False)
 
model.to(device)
model.load_state_dict(torch.load('./finetuned_finBERT_epoch_1.model', map_location=torch.device('cpu')))
_, predictions, true_vals = evaluate(dataloader_validation)
 
accuracy_per_class(predictions, true_vals)

上述代码加载了之前保存的第一个周期的最佳模型，并对验证集进行情感分类预测。然后，计算并打印了每个情感类别的准确率（accuracy）。执行后会输出：


Class: neutral
Accuracy: 385/432
 
Class: negative
Accuracy: 81/91
Class: positive
Accuracy: 170/204

通过执行这段代码，可以获取模型在不同情感类别上的准确率，从而评估模型的性能和分类能力，这有助于了解模型在各个类别上的表现如何。

完结

（9-4-1）预训练模型：FinBERT-CSDN博客

声明：本文内容由网友自发贡献，转载请注明出处：【wpsshop博客】