最强的数据扩增方法竟然是添加标点符号？_给文本添加逗号是数据增强

作者：weixin_40725706 | 2024-06-10 00:47:23

踩

给文本添加逗号是数据增强

今天的这篇文章源自于EMNLP 2021 Findings，论文标题为《AEDA: An Easier Data Augmentation Technique for Text Classification》。实际上用一句话即可总结全文：对于文本分类任务来说，在句子中插入一些标点符号是最强的数据扩增方法

AEDA Augmentation

读者看到这肯定会想问：添加哪些标点符号？加多少？对于这些问题，原论文给出了非常详细的解答，同时这部分也是论文唯一有价值的地方，其他部分的文字叙述基本是在说一些基础概念，或者前人工作等等

首先，可选的标点符号有六个：{".", “;”, “?”, “:”, “!”, “,”}。其次，设添加句子标点的个数为 $n$ ，则

$n\in [1, \frac{1}{3}l]$

其中， $l$ 为句子长度。下面给出几个扩增例子

\begin{array}{cc} Original & a sad , superior human comedy played out on the back roads of life . \\ Aug 1 & a sad , superior human comedy played out on the back roads ; of life ; . \\ Aug 2 & a , sad . , superior human ; comedy . played . out on the back roads of life . \\ Aug 3 & : a sad ; , superior ! human : comedy , played out ? on the back roads of life . \end{array}

$\begin{array}{cc} \hline \textbf{Original} & \text{a sad , superior human comedy played out on the back roads of life .} \\ \hline \textbf{Aug 1} & \text{a sad , superior human comedy played out on the back roads ; of life ; .}\\ \hline \textbf{Aug 2} & \text{a , sad . , superior human ; comedy . played . out on the back roads of life .}\\ \hline \textbf{Aug 3} & \text{: a sad ; , superior ! human : comedy , played out ? on the back roads of life .}\\ \hline \end{array}$

Original Aug 1 Aug 2 Aug 3 a sad , superior human comedy played out on the back roads of life . a sad , superior human comedy played out on the back roads ; of life ; . a , sad . , superior human ; comedy . played . out on the back roads of life . : a sad ; , superior ! human : comedy , played out ? on the back roads of life .

光说不练假把式，效果究竟几何呢？原论文做了大量文本分类任务的实验，并且与EDA方法进行了比较，而且有意思的是，AEDA在github上的repo是fork自EDA论文的repo，怎么有种杀鸡取卵的感觉

首先看下面一组图，作者在5个数据集上进行了对比（模型为RNN）

在BERT上的效果如下表所示，为什么上面都测了5个数据集，而论文中对BERT只展示了2个数据集的结果呢？我大胆猜测是因为在其他数据集上的效果不太好，否则没有理由不把其余数据集的结果贴出来

\begin{array}{ccc} Model & SST2 & TREC \\ BERT & 91.10 & 97.00 \\ +EDA & 90.99 & 96.00 \\ +AEDA & 91.76 91.76 & 97.20 97.20 \end{array}

$\begin{array}{c|cc} \text{Model} & \text{SST2} & \text{TREC} \\ \hline \text{BERT} & 91.10 & 97.00\\ \hline \text{+EDA} & 90.99 & 96.00\\ \hline \text{+AEDA} & \pmb{91.76} & \pmb{97.20}\\ \end{array}$

Model BERT +EDA +AEDA SST2 91.10 90.99 91.76 91.76 91.76 TREC 97.00 96.00 97.20 97.20 97.20

Reference

AEDA: An Easier Data Augmentation Technique for Text Classification

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/weixin_40725706/article/detail/696463