赞
踩
项目来源于kaggle竞赛,地址为:Natural Language Processing with Disaster Tweets | Kaggle
本文主要是对本人学习NLP的过程做个总结和记录,以方便日后复习,当然如果本文能帮助到阅读该文的读者,我会感到很开心。
该项目通过建立一个机器学习模型,预测哪些推文是关于真实灾难的,哪些不是。
首先读取数据,看一下数据的样子,数据在kaggle上可直接下载。可以看到有用的是text列和target列,我们只需要这两列进行模型训练即可,target为1表示text是真实灾难相关的文本,反之则不是。
- import pandas as pd
-
- train_df = pd.read_csv('/kaggle/input/nlpdataset/train.csv')
- test_df = pd.read_csv('/kaggle/input/nlpdataset/test.csv')
- train_df
然后定义缩写字符变扩写字符的字典,以便将文本中的缩写字符扩展为全称,在NLP预处理数据过程中,将英文缩写扩展为全称的作用是为了将缩写与其全称等效地表示,以便于算法处理。例如,将"I'm" 扩展为"I am",将"don't" 扩展为"do not",将"it's" 扩展为"it is" 或 "it has"。
缩写扩展的作用主要是为了消除歧义:同一个缩写可能有多种可能的解释,例如"AI"可以表示"Artificial Intelligence"或者"Air India"。将缩写扩展为全称可以消除这种歧义,确保模型可以正确地理解数据。
日后如果有文本缩写扩展处理的需要,可直接复制该字典,然后进行替换操作。
- # 缩写变扩写
- contractions = {
- "ain't": "am not",
- "aren't": "are not",
- "can't": "cannot",
- "can't've": "cannot have",
- "'cause": "because",
- "could've": "could have",
- "couldn't": "could not",
- "couldn't've": "could not have",
- "didn't": "did not",
- "doesn't": "does not",
- "don't": "do not",
- "hadn't": "had not",
- "hadn't've": "had not have",
- "hasn't": "has not",
- "haven't": "have not",
- "he'd": "he would",
- "he'd've": "he would have",
- "he'll": "he will",
- "he'll've": "he will have",
- "he's": "he is",
- "how'd": "how did",
- "how'd'y": "how do you",
- "how'll": "how will",
- "how's": "how is",
- "i'd": "I would",
- "i'd've": "I would have",
- "i'll": "I will",
- "i'll've": "I will have",
- "i'm": "I am",
- "i've": "I have",
- "isn't": "is not",
- "it'd": "it would",
- "it'd've": "it would have",
- "it'll": "it will",
- "it'll've": "it will have",
- "it's": "it is",
- "let's": "let us",
- "ma'am": "madam",
- "mayn't": "may not",
- "might've": "might have",
- "mightn't": "might not",
- "mightn't've": "might not have",
- "must've": "must have",
- "mustn't": "must not",
- "mustn't've": "must not have",
- "needn't": "need not",
- "needn't've": "need not have",
- "o'clock": "of the clock",
- "oughtn't": "ought not",
- "oughtn't've": "ought not have",
- "shan't": "shall not",
- "sha'n't": "shall not",
- "shan't've": "shall not have",
- "she'd": "she would",
- "she'd've": "she would have",
- "she'll": "she will",
- "she'll've": "she will have",
- "she's": "she is",
- "should've": "should have",
- "shouldn't": "should not",
- "shouldn't've": "should not have",
- "so've": "so have",
- "so's": "so is",
- "that'd": "that would",
- "that'd've": "that would have",
- "that's": "that is",
- "there'd": "there would",
- "there's": "there is",
- "they'd": "they would",
- "they'd've": "they would have",
- "they'll": "they will",
- "they'll've": "they will have",
- "they're": "they are",
- "they've": "they have",
- "to've": "to have",
- "wasn't": "was not",
- "we'd": "we would",
- "we'd've": "we would have",
- "we'll": "we will",
- "we'll've": "we will have",
- "we're": "we are",
- "we've": "we have",
- "weren't": "were not",
- "what'll": "what will",
- "what'll've": "what will have",
- "what're": "what are",
- "what's": "what is",
- "what've": "what have",
- "when's": "when is",
- "when've": "when have",
- "where'd": "where did",
- "where's": "where is",
- "where've": "where have",
- "who'll": "who will",
- "who'll've": "who will have",
- "who's": "who is",
- "who've": "who have",
- "why's": "why is",
- "why've": "why have",
- "will've": "will have",
- "won't": "will not",
- "won't've": "will not have",
- "would've": "would have",
- "wouldn't": "would not",
- "wouldn't've": "would not have",
- "y'all": "you all",
- "y'all'd": "you all would",
- "y'all'd've": "you all would have",
- "y'all're": "you all are",
- "y'all've": "you all have",
- "you'd": "you would",
- "you'd've": "you would have",
- "you'll": "you will",
- "you'll've": "you will have",
- "you're": "you are",
- "you've": "you have"
- }
-
- country_contractions = {
- 'u.s': 'united states',
- 'u.s.': 'united states',
- 'u.s.a': 'united states',
- 'u.k': 'united kingdom',
- 'u.k.': 'united kingdom',
- 'u.a.e': 'united arab emirates',
- 'u.a.e.': 'united arab emirates',
- 's.korea': 'south korea',
- 'n.korea': 'north korea',
- 'czech rep.': 'czech republic',
- 'dominican rep.': 'dominican republic',
- 'costa rica': 'republic of costa rica',
- 'el salvador': 'republic of el salvador',
- 'guinea-bissau': 'republic of guinea-bissau',
- 'cote d\'ivoire': 'republic of cote d\'ivoire',
- 'trinidad & tobago': 'republic of trinidad and tobago',
- 'congo-brazzaville': 'republic of the congo',
- 'congo-kinshasa': 'democratic republic of the congo',
- 'sri lanka': 'democratic socialist republic of sri lanka',
- 'central african rep.': 'central african republic',
- 'san marino': 'republic of san marino',
- 'são tomé & príncipe': 'democratic republic of são tomé and príncipe',
- 'timor-leste': 'democratic republic of timor-leste'
- }

定义缩写扩展的函数,需要注意的地方在注释中已经写出。
- # 缩写变扩写
- def expand_contractions(text,contractions):
- words = text.split()
- expand_contractions = []
- for word in words:
- if word.lower() in contractions:#判断word是否在contractions字典的键中出现
- expand_contraction = contractions[word.lower()]
- expand_contractions.append(expand_contraction)
- else:
- expand_contractions.append(word)
- return ' '.join(expand_contractions)# join函数用于将列表中的字符连接为一个字符串(以空格分割)
进行文本清洗,包括缩写字符扩展,以及将URL、数字和标点替换为空格。URL、标点、数字等在自然语言文本中属于噪声,将它们替换为空格可以去除这些噪声,以便模型可以更好的学习到有用的特征。
- import re
-
- # 应用expand_contractions并用sub函数进行url、空格、标点、数字字符替换为空格
- def clean_text(text):
- # 替换一般缩略字符和城市缩略字符
- text = expand_contractions(text, contractions)
- text = expand_contractions(text, country_contractions)
-
- text = re.sub(r'http\S+|https\S+|www\S+', '', text, flags = re.MULTILINE)
- # \S代表匹配任意非空字符
- # 其中 \S 表示匹配非空白字符,+ 表示匹配一个或多个。
- # \S+ 表示匹配一个或多个非空白字符。| 表示或的关系。
- # \S+ 和 | 组合在一起,表示匹配以 http、www 或 https 开头的字符串。
- # flags=re.MULTILINE这个参数是多余的,可以不加
-
- text = re.sub(r'\W', ' ', text)# 非字母数字字符 空格,标点符号等
- text = re.sub(r'\d', ' ', text)# 数字字符
- text = re.sub(r'\s+', ' ', text).strip()# 匹配一个或多个空格字符,替换为空格,strip函数删除开头和结尾的空格
-
- return text
-
- train_df['clean_text'] = train_df['text'].apply(clean_text)
- test_df['clean_text'] = test_df['text'].apply(clean_text)
- train_df

分割训练集和验证集,注意原始数据集中本就存在测试集。
- import math
-
- split_count = math.floor(len(train_df) / 10 * 9)
- val_df = train_df[split_count:]
- train_df = train_df[:split_count]
之后通过Transfoemers库定义DistilBert模型的tokenizer和model,这里使用的是DistilBert模型,它是BERT模型的一种轻量级版本,通过压缩BERT模型的大小和计算量以及进行知识蒸馏,使得DistilBERT在保持高性能的同时,具有更小的模型体积和更快的推理速度。
Transformers是由Hugging Face公司开发的一个自然语言处理(NLP)库,它提供了一系列基于深度学习的预训练模型,包括BERT、GPT、RoBERTa、DistilBERT等,方便用户进行模型微调以应用到不同的任务中,Transformers库的应用场景广泛,可以用于各种NLP任务,如情感分析、机器翻译、文本分类、问答系统等。
如果想更进一步了解和学习Transformers库,这里给出pytorch官网关于Transformers库的链接:PyTorch-Transformers | PyTorch
- from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
- import torch
-
- MODEL_NAME = 'distilbert-base-uncased'
- tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)
- model = DistilBertForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)
之后将数据集进行格式转换,最终转换为模型输入所需要的torch格式。tokenizer函数的作用是将输入的文本数据进行分词并进行必要的处理,padding和truncation参数用于控制padding和截断的行为。padding参数用于将所有输入的文本数据padding到相同的长度,以便进行批量处理。如果一个样本的长度小于指定的length ,则用pad_token_id进行填充,如果长度大于length ,则根据truncation参数的设置进行截断。
- from transformers import TrainingArguments, Trainer
-
- def tokenize(batch):
- return tokenizer(batch['text'], padding=True, truncation=True)
-
- test_df["target"] = 0
-
- #取出文本和标签
- train_dataset = train_df[['clean_text', 'target']]
- val_dataset = val_df[['clean_text', 'target']]
- test_dataset = test_df[['clean_text', 'target']]
-
- #更改列名
- train_dataset = train_dataset.rename(columns={'clean_text': 'text', 'target': 'label'})
- val_dataset = val_dataset.rename(columns={'clean_text': 'text', 'target': 'label'})
- test_dataset = test_dataset.rename(columns={'clean_text': 'text', 'target': 'label'})
-
- #records参数代表将每一行转换为一个字典,并存储在一个列表中
- train_dataset = train_dataset.to_dict('records')
- val_dataset = val_dataset.to_dict('records')
- test_dataset = test_dataset.to_dict('records')
-
- from datasets import Dataset
-
- #其实可以直接将train_dataset转换为Dataset,不需要进行字典转换,多此一举了
- train_dataset = Dataset.from_pandas(pd.DataFrame(train_dataset))
- val_dataset = Dataset.from_pandas(pd.DataFrame(val_dataset))
- test_dataset = Dataset.from_pandas(pd.DataFrame(test_dataset))
-
- # 将batch的大小设置为整个数据集的大小,意味着将整个数据集进行tokenize操作
- train_dataset = train_dataset.map(tokenize, batched=True, batch_size=len(train_dataset))
- val_dataset = val_dataset.map(tokenize, batched=True, batch_size=len(val_dataset))
- test_dataset = test_dataset.map(tokenize, batched=True, batch_size=len(test_dataset))
-
- # 将数据集转换为torch格式
- train_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
- val_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
- test_dataset.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])

设置训练参数和训练器,训练并保存模型。
- import os
-
- os.environ["WANDB_DISABLED"] = "true"
-
- # 设置训练参数
- training_args = TrainingArguments(
- output_dir='./results',
- num_train_epochs=10,
- per_device_train_batch_size=16,
- per_device_eval_batch_size=16,
- warmup_steps=500,
- weight_decay=0.01,
- logging_dir='./logs',
- logging_steps=10,
- load_best_model_at_end=True,
- evaluation_strategy="epoch",
- save_strategy="epoch"
- )
-
- # 初始化训练器
- trainer = Trainer(
- model=model,
- args=training_args,
- train_dataset=train_dataset,
- eval_dataset=val_dataset,
- tokenizer=tokenizer,
- )
-
- # 训练模型
- trainer.train()
-
- # 保存模型参数
- model.save_pretrained("./model")

加载模型并在测试集上进行预测。
- # 加载模型
- loaded_model = DistilBertForSequenceClassification.from_pretrained("./model")
-
- device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
-
- def predict(text, loaded_model, tokenizer):
- inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
- inputs.to(device)
- outputs = loaded_model(**inputs)
- logits = outputs.logits
- probabilities = torch.softmax(logits, dim=-1)
- predictions = torch.argmax(probabilities, dim=-1)
- return predictions.item()
-
- test_df["predicted_target"] = test_df["clean_text"].apply(lambda x: predict(x, model, tokenizer))
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。