- X_train, X_val, y_train, y_val = train_test_split(financial_data.index.values,
- financial_data.label.values,
- test_size = 0.15,
- random_state = 2022,
- stratify = financial_data.label.values)
(14)在financial_data DataFrame 中为训练集(X_train)和验证集(X_val)的样本添加一个名为data_type的新列,并设置其值为'train'和'val',以表示每个样本属于训练集或验证集。然后,计算了每种情感标签在不同数据类型(训练集或验证集)中的出现次数。具体实现代码如下所示。
- financial_data.loc[X_train, 'data_type'] = 'train'
- financial_data.loc[X_val, 'data_type'] = 'val'
- financial_data.groupby(['sentiment', 'label', 'data_type']).count()
- NewsHeadline
- sentiment label data_type
- negative 1 train 513
- val 91
- neutral 0 train 2447
- val 432
- positive 2 train 1159
- val 204
(15)开始查看新闻标题长度的分布,以确定在进行数据编码时需要设定的最大长度。首先使用库Hugging Face Transformers中的BertTokenizer类,从预训练模型"ProsusAI/finbert"中加载了一个金融领域的文本分词器(Tokenizer)。具体实现代码如下所示。
finbert_tokenizer = BertTokenizer.from_pretrained("ProsusAI/finbert", do_lower_case=True)
- Downloading: 100%|██████████|226k/226k [00:00<00:00, 807kB/s]
- Downloading: 100%|██████████|112/112 [00:00<00:00, 4.33kB/s]
- Downloading: 100%|██████████|252/252 [00:00<00:00, 10.1kB/s]
- Downloading: 100%|██████████|758/758 [00:00<00:00, 11.3kB/s]
headlines_sequence_lengths = get_headlines_len(financial_data)
- Encoding in progress...
- 100%|██████████| 4846/4846 [00:04<00:00, 1136.56it/s]
- End of Task.
图9-2 新闻标题长度的分布情况
- encoded_data_train = finbert_tokenizer.batch_encode_plus(
- financial_data[financial_data.data_type=='train'].NewsHeadline.values,
- return_tensors='pt',
- add_special_tokens=True,
- return_attention_mask=True,
- pad_to_max_length=True,
- max_length=150 # the maximum lenght observed in the headlines
- )
- encoded_data_val = finbert_tokenizer.batch_encode_plus(
- financial_data[financial_data.data_type=='val'].NewsHeadline.values,
- return_tensors='pt',
- add_special_tokens=True,
- return_attention_mask=True,
- pad_to_max_length=True,
- max_length=150 # the maximum lenght observed in the headlines
- )
- input_ids_train = encoded_data_train['input_ids']
- attention_masks_train = encoded_data_train['attention_mask']
- labels_train = torch.tensor(financial_data[financial_data.data_type=='train'].label.values)
- input_ids_val = encoded_data_val['input_ids']
- attention_masks_val = encoded_data_val['attention_mask']
- sentiments_val = torch.tensor(financial_data[financial_data.data_type=='val'].label.values)
- dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
- dataset_val = TensorDataset(input_ids_val, attention_masks_val, sentiments_val)
- len(sentiment_dict)
- Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
- 3
(19)使用库Hugging Face Transformers中的类AutoModelForSequenceClassification,从预训练模型"ProsusAI/finbert"中加载了一个用于序列分类(Sequence Classification)任务的模型。具体实现代码如下所示。
- model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert",
- num_labels=len(sentiment_dict),
- output_attentions=False,
- output_hidden_states=False)
- batch_size = 5
- dataloader_train = DataLoader(dataset_train,
- sampler=RandomSampler(dataset_train),
- batch_size=batch_size)
- dataloader_validation = DataLoader(dataset_val,
- sampler=SequentialSampler(dataset_val),
- batch_size=batch_size)
- optimizer = AdamW(model.parameters(),
- lr=1e-5,
- eps=1e-8)
- epochs = 2
- scheduler = get_linear_schedule_with_warmup(optimizer,
- num_warmup_steps=0,
- num_training_steps=len(dataloader_train)*epochs)
- seed_val = 2022
- random.seed(seed_val)
- np.random.seed(seed_val)
- torch.manual_seed(seed_val)
- torch.cuda.manual_seed_all(seed_val)
- device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
- model.to(device)
- for epoch in tqdm(range(1, epochs+1)):
- model.train()
- loss_train_total = 0
- progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
- for batch in progress_bar:
- model.zero_grad()
- batch = tuple(b.to(device) for b in batch)
- inputs = {'input_ids': batch[0],
- 'attention_mask': batch[1],
- 'labels': batch[2],
- }
- outputs = model(**inputs)
- loss = outputs[0]
- loss_train_total += loss.item()
- loss.backward()
- torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
- optimizer.step()
- scheduler.step()
- progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
- torch.save(model.state_dict(), f'finetuned_finBERT_epoch_{epoch}.model')
- tqdm.write(f'\nEpoch {epoch}')
- loss_train_avg = loss_train_total/len(dataloader_train)
- tqdm.write(f'Training loss: {loss_train_avg}')
- val_loss, predictions, true_vals = evaluate(dataloader_validation)
- val_f1 = f1_score_func(predictions, true_vals)
- tqdm.write(f'Validation loss: {val_loss}')
- tqdm.write(f'F1 Score (Weighted): {val_f1}')
Epoch 1: 0%| | 0/824 [00:00<?, ?it/s]
- Epoch 1: 0%| | 0/824 [00:00<?, ?it/s]
- Epoch 1: 0%| | 0/824 [00:01<?, ?it/s, training_loss=1.028]
- Epoch 1: 0%| | 1/824 [00:01<13:58, 1.02s/it, training_loss=1.028]
- Epoch 1: 0%| | 1/824 [00:01<13:58, 1.02s/it, training_loss=0.840]
- Epoch 1: 0%| | 2/824 [00:01<06:44, 2.03it/s, training_loss=0.840]
- Epoch 1: 0%| | 2/824 [00:01<06:44, 2.03it/s, training_loss=0.459]
- Epoch 1: 0%| | 3/824 [00:01<04:24, 3.10it/s, training_loss=0.459]
- Epoch 1: 0%| | 3/824 [00:01<04:24, 3.10it/s, training_loss=0.382]
- Epoch 1: 0%| | 4/824 [00:01<03:18, 4.13it/s, training_loss=0.382]
- Epoch 1: 0%| | 4/824 [00:01<03:18, 4.13it/s, training_loss=0.300]
- ######省略部分过程
Epoch 1: 100%|██████████| 824/824 [01:39<00:00, 8.74it/s, training_loss=0.006]
- 0%| | 0/2 [01:40<?, ?it/s]
- Epoch 1
- Training loss: 0.4581567718019004
- ######省略部分过程
- 50%|█████ | 1/2 [01:44<01:44, 104.72s/it]
- Validation loss: 0.3778555450533606
- F1 Score (Weighted): 0.8753212258177056
Epoch 2: 0%| | 0/824 [00:00<?, ?it/s]
- Epoch 2: 0%| | 0/824 [00:00<?, ?it/s, training_loss=0.004]
- Epoch 2: 0%| | 1/824 [00:00<01:32, 8.87it/s, training_loss=0.004]
- Epoch 2: 0%| | 1/824 [00:00<01:32, 8.87it/s, training_loss=0.005]
- ######省略部分过程
- Epoch 2: 100%|█████████▉| 821/824 [01:38<00:00, 8.36it/s, training_loss=0.299]
- Epoch 2: 100%|█████████▉| 821/824 [01:38<00:00, 8.36it/s, training_loss=0.003]
- Epoch 2: 100%|█████████▉| 822/824 [01:38<00:00, 8.41it/s, training_loss=0.003]
- Epoch 2: 100%|█████████▉| 822/824 [01:38<00:00, 8.41it/s, training_loss=0.002]
- Epoch 2: 100%|█████████▉| 823/824 [01:38<00:00, 8.51it/s, training_loss=0.002]
- Epoch 2: 100%|█████████▉| 823/824 [01:38<00:00, 8.51it/s, training_loss=0.003]
Epoch 2: 100%|██████████| 824/824 [01:38<00:00, 8.74it/s, training_loss=0.003]
- 50%|█████ | 1/2 [03:24<01:44, 104.72s/it]
- Epoch 2
- Training loss: 0.2466177608593462
- 100%|██████████| 2/2 [03:28<00:00, 104.14s/it]
- Validation loss: 0.43929754253413067
- F1 Score (Weighted): 0.8823813944083021
- Epoch 1
- Training loss: 0.4581567718019004
- Validation loss: 0.3778555450533606
- F1 Score (Weighted): 0.8753212258177056
- Epoch 2
- Training loss: 0.2466177608593462
- Validation loss: 0.43929754253413067
- F1 Score (Weighted): 0.8823813944083021
过拟合是指模型在训练数据上表现良好,但在未见过的验证数据上表现较差的情况。因为模型在第一个周期已经开始出现过拟合迹象,继续训练可能会导致模型在验证集上的性能下降。停止训练以防止过拟合是一个明智的决策,可以保持模型在验证数据上的泛化性能。此外,可以通过其他技术如早停止(Early Stopping)来进一步优化模型的训练,以在适当的时机停止训练并保存性能最佳的模型。
(23)根据前面的输出结果可知,最佳模型是在第一个训练周期(Epoch 1)结束时的模型。要加载这个最佳模型以进行预测,可以执行以下操作:
- model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert",
- num_labels=len(sentiment_dict),
- output_attentions=False,
- output_hidden_states=False)
- model.to(device)
- model.load_state_dict(torch.load('./finetuned_finBERT_epoch_1.model', map_location=torch.device('cpu')))
- _, predictions, true_vals = evaluate(dataloader_validation)
- accuracy_per_class(predictions, true_vals)
- Class: neutral
- Accuracy: 385/432
- Class: negative
- Accuracy: 81/91
- Class: positive
- Accuracy: 170/204
