当前位置:   article > 正文

数据增强 数据集扩充_数据扩充的抽象总结

句子扩写 数据集

数据增强 数据集扩充

班级分配不均衡的创新解决方案 (A Creative Solution to Imbalanced Class Distribution)

Imbalanced class distribution is a common problem in Machine Learning. I was recently confronted with this issue when training a sentiment classification model. Certain categories were far more prevalent than others and the predictive quality of the model suffered. The first technique I used to address this was random under-sampling, wherein I randomly sampled a subset of rows from each category up to a ceiling threshold. I selected a ceiling that reasonably balanced the upper 3 classes. Although a small improvement was observed, the model was still far from optimal.

班级分配不平衡是机器学习中的常见问题。 最近,我在训练情感分类模型时遇到了这个问题。 某些类别比其他类别更为普遍,因此模型的预测质量受到影响。 我用来解决此问题的第一个技术是随机欠采样,其中我从每个类别中随机采样了行的子集,直到上限阈值。 我选择了一个合理地平衡前三类的上限。 尽管观察到很小的改进,但是该模型仍远非最佳。

I needed a way to deal with the under-represented classes. I could not rely on traditional techniques used in multi-class classification such as sample and class weighting, as I was working with a multi-label dataset. It became evident that I would need to leverage oversampling in this situation.

我需要一种方法来处理代表性不足的课程。 当我使用多标签数据集时,我不能依赖于用于多类分类的传统技术,例如样本和类加权。 很明显,在这种情况下,我需要利用过度采样。

A technique such as SMOTE (Synthetic Minority Over-sampling Technique) can be effective for oversampling, although the problem again becomes a bit more difficult with multi-label datasets. MLSMOTE (Multi-Label Synthetic Minority Over-sampling Technique) has been proposed [1], but the high dimensional nature of the numerical vectors created from text can sometimes make other forms of data augmentation more appealing.

诸如SMOTE(合成少数族裔过采样技术)之类的技术可以有效地进行过采样,尽管对于多标签数据集,问题再次变得更加棘手。 已经提出了MLSMOTE (多标签综合少数族裔过采样技术)[1],但是从文本创建的数字矢量的高维性质有时会使其他形式的数据增强更具吸引力。

Image for post
Photo by Christian Wagner on Unsplash
克里斯蒂安·瓦格纳在《 Unsplash》上的 照片

变形金刚救援! (Transformers to the Rescue!)

If you decided to read this article, it is safe to assume that you are aware of the latest advances in Natural Language Processing bequeathed by the mighty Transformers. The exceptional developers at Hugging Face in particular have opened the door to this world through their open source contributions. One of their more recent releases implements a breakthrough in Transfer Learning called the Text-to-Text Transfer Transformer or T5 model, originally presented by Raffel et. al. in their paper Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer [2].

如果您决定阅读本文,可以假定您了解强大的变形金刚在自然语言处理方面的最新进展。 Hugging Face的杰出开发人员尤其通过其开源贡献为这个世界打开了一扇门。 他们的一个更新的版本工具的转移的突破口学习所谓的T外部- 牛逼T外部贸易交接牛逼 ransformer或T5型号,最初由拉费尔等人提出的。 等 在他们的论文《使用统一的文本到文本的转换器探索迁移学习的局限性》中 [2]。

T5 allows us to execute various NLP tasks by specifying prefixes to the input text. In my case, I was interested in Abstractive Summarization, so I made use of the summarize prefix.

T5允许我们通过指定输入文本的前缀来执行各种NLP任务。 就我而言,我感兴趣的是写意总结,所以我利用的summarize前缀。

Image for post
Text-to-Text Transfer Transformer [2]
文本到文本传输变压器[2]

抽象总结 (Abstractive Summarization)

Abstractive Summarization put simplistically is a technique by which a chunk of text is fed to an NLP model and a novel summary of that text is returned. This should not be confused with Extractive Summarization, where sentences are embedded and a clustering algorithm is executed to find those closest to the clusters’ centroids — namely, existing sentences are returned. Abstractive Summarization seemed particularly appealing as a Data Augmentation technique because of its ability to generate novel yet realistic sentences of text.

简而言之,抽象摘要是一种将文本块输入NLP模型并返回该文本的新颖摘要的技术。 这不应与“提取摘要”相混淆,在“摘要提取”中嵌入句子并执行聚类算法以查找最接近聚类质心的那些,即返回现有的句子。 抽象汇总作为一种数据增强技术特别吸引人,因为它能够生成新颖而逼真的文本句子。

算法 (Algorithm)

Here are the steps I took to use Abstractive Summarization for Data Augmentation, including code segments illustrating the solution.

这是我使用抽象汇总进行数据增强所采取的步骤,包括说明解决方案的代码段。

I first needed to determine how many rows each under-represented class required. The number of rows to add for each feature is thus calculated with a ceiling threshold, and we refer to these as the append_counts. Features with counts above the ceiling are not appended. In particular, if a given feature has 1000 rows and the ceiling is 100, its append count will be 0. The following methods trivially achieve this in the situation where features have been one-hot encoded:

首先,我需要确定每个代表性不足的类需要多少行。 因此,每个特征要添加的行数是使用上限阈值计算的,我们将其称为append_counts 。 计数不超过上限的要素不会被附加。 特别是,如果给定要素具有1000行且上限为100,则其附加计数将为0。在要素已被一键编码的情况下,以下方法可以轻松实现此目的:

def get_feature_counts(self, df):    shape_array = {}    for feature in self.features:        shape_array[feature] = df[feature].sum()    return shape_arraydef get_append_counts(self, df):    append_counts = {}    feature_counts = self.get_feature_counts(df)    for feature in self.features:        if feature_counts[feature] >= self.threshold:            count = 0        else:            count = self.threshold - feature_counts[feature]        append_counts[feature] = count    return append_counts

For each feature, a loop is completed from an append index range to the append count specified for that given feature. This append_index variable along with a tasks array are introduced to allow for multi-processing which we will discuss shortly.

对于每个功能,从附加索引范围到为该给定功能指定的附加计数的循环完成。 引入了这个append_index变量以及一个task数组,以允许进行多重处理,我们将在稍后进行讨论。

counts = self.get_append_counts(self.df)# Create append dataframe with length of all rows to be appendedself.df_append = pd.DataFrame(    index=np.arange(sum(counts.values())),    columns=self.df.columns)# Creating array of tasks for multiprocessingtasks = []# set all feature values to 0for feature in self.features:    self.df_append[feature] = 0for feature in self.features:    num_to_append = counts[feature]    for num in range(            self.append_index,            self.append_index + num_to_append    ):        tasks.append(            self.process_abstractive_summarization(feature, num)        )    # Updating index for insertion into shared appended dataframe     # to preserve indexing for multiprocessing    self.append_index += num_to_append

An Abstractive Summarization is calculated for a specified size subset of all rows that uniquely have the given feature, and is added to the append DataFrame with its respective feature one-hot encoded.

为唯一具有给定特征的所有行的指定大小的子集计算一个摘要汇总,并将其摘要附加到附加DataFrame中,并对其各个特征进行一次热编码。

df_feature = self.df[    (self.df[feature] == 1) &    (self.df[self.features].sum(axis=1) == 1)]df_sample = df_feature.sample(self.num_samples, replace=True)text_to_summarize = ' '.join(    df_sample[:self.num_samples]['review_text'])new_text = self.get_abstractive_summarization(text_to_summarize)self.df_append.at[num, 'text'] = new_textself.df_append.at[num, feature] = 1

The Abstractive Summarization itself is generated in the following way:

摘要汇总本身是通过以下方式生成的:

t5_prepared_text = "summarize: " + text_to_summarizeif self.device.type == 'cpu':    tokenized_text = self.tokenizer.encode(        t5_prepared_text,        return_tensors=self.return_tensors).to(self.device)else:    tokenized_text = self.tokenizer.encode(        t5_prepared_text,        return_tensors=self.return_tensors)summary_ids = self.model.generate(    tokenized_text,    num_beams=self.num_beams,    no_repeat_ngram_size=self.no_repeat_ngram_size,    min_length=self.min_length,    max_length=self.max_length,    early_stopping=self.early_stopping)output = self.tokenizer.decode(    summary_ids[0],    skip_special_tokens=self.skip_special_tokens)

In initial tests the summarization calls to the T5 model were extremely time-consuming, reaching up to 25 seconds even on a GCP instance with an NVIDIA Tesla P100. Clearly this needed to be addressed to make this a feasible solution for data augmentation.

在最初的测试中,对T5模型的汇总调用非常耗时,即使在使用NVIDIA Tesla P100的GCP实例上,也要长达25秒。 显然,需要解决此问题,以使其成为可行的数据增强解决方案。

Image for post
Photo by Brad Neathery on Unsplash
Brad NeatheryUnsplash拍摄的照片

多处理 (Multiprocessing)

I introduced a multiprocessing option, whereby the calls to Abstractive Summarization are stored in a task array later passed to a sub-routine that runs the calls in parallel using the multiprocessing library. This resulted in an exponential decrease in runtime. I must thank David Foster for his succinct stackoverflow contribution [3]!

我介绍了一个multiprocessing选项,其中对抽象总结的调用存储在一个任务数组中,然后传递给一个子例程,该子例程使用多处理库并行运行这些调用。 这导致运行时间呈指数下降。 我必须感谢David Foster所做的简洁的stackoverflow贡献[3]!

running_tasks = [Process(target=task) for task in tasks]for running_task in running_tasks:    running_task.start()for running_task in running_tasks:    running_task.join()

简化解决方案 (Simplified Solution)

To make things easier for everybody I packaged this into a library called absum. Installing is possible through pip:pip install absum. One can also download directly from the repository.

为了使每个人都更容易,我将其打包到一个名为absum的库中。 可以通过pip install absumpip install absum 。 也可以直接从资源库下载。

Running the code on your own dataset is then simply a matter of importing the library’s Augmentor class and running its abs_sum_augment method as follows:

在自己的数据集运行的代码则只需导入库的事项Augmentor类和运行其abs_sum_augment方法如下:

import pandas as pdfrom absum import Augmentorcsv = 'path_to_csv'df = pd.read_csv(csv)augmentor = Augmentor(df)df_augmented = augmentor.abs_sum_augment()df_augmented.to_csv(    csv.replace('.csv', '-augmented.csv'),     encoding='utf-8',     index=False)

absum uses the Hugging Face T5 model by default, but is designed in a modular way to allow you to use any pre-trained or out-of-the-box Transformer models capable of Abstractive Summarization. It is format agnostic, expecting only a DataFrame containing text and one-hot encoded features. If additional columns are present that you do not wish to be considered, you have the option to pass in specific one-hot encoded features as a comma-separated string to the features parameter.

absum默认情况下使用Hugging Face T5模型,但以模块化方式设计,允许您使用任何能够进行抽象总结的预训练或开箱即用的Transformer模型。 它与格式无关,只希望包含文本和一键编码功能的DataFrame。 如果存在您不希望考虑的其他列,则可以选择将特定的一键编码特征作为逗号分隔的字符串传递给features参数。

Also of special note are the min_length and max_length parameters, which determine the size of the resulting summarizations. One trick I found useful is to find the average character count of the text data you’re working with and start with something a bit lower for the minimum length while slightly padding it for the maximum. All available parameters are detailed in the documentation.

还要特别注意的是min_lengthmax_length参数,它们确定所得汇总的大小。 我发现有用的一个技巧是找到正在使用的文本数据的平均字符数,并从最小长度的小一些开始,而最大长度的填充一些。 文档中详细介绍了所有可用参数。

Feel free to add any suggestions for improvement in the comments or even better yet in a PR. Happy coding!

可以随意添加任何建议以改善评论,甚至可以改善PR 。 编码愉快!

翻译自: https://towardsdatascience.com/abstractive-summarization-for-data-augmentation-1423d8ec079e

数据增强 数据集扩充

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/凡人多烦事01/article/detail/480004
推荐阅读
相关标签
  

闽ICP备14008679号