赞
踩
转载自:我的个人博客
在项目推进过程中,产生了对生成句子多样性进行评价、筛选的需求。遂调研了部分现有的无监督句子多样性的评价指标,以备参考使用。
paper: BERTSCORE: EVALUATING TEXT GENERATION WITH BERT
每个词找另一个句子中和它内积最大的词
R B E R T = 1 ∣ x ∣ Σ x i ∈ x m a x x ^ ∈ x ^ x i T x j ^ , P B E R T = 1 ∥ x ^ ∥ Σ x i ^ ∈ x ^ m a x x ^ ∈ x ^ x i T x j ^ , F B E R T = 2 P B E R T ⋅ R B E R T P B E R T + R B E R T R_{BERT} = \frac{1}{|x|} \underset{x_i \in x}{\Sigma} \underset{\hat{x}\in \hat{x}} {max} x_i^{T} \hat{x_j}, \quad P_{BERT} = \frac{1}{\|\hat{x}\|} \underset{\hat{x_i} \in \hat{x}}{\Sigma} \underset{\hat{x} \in \hat{x}}{max} x_i^{T} \hat{x_j}, \quad F_{BERT} = 2\frac{P_{BERT}\cdot R_{BERT}}{P_{BERT} + R_{BERT}} RBERT=∣x∣1xi∈xΣx^∈x^maxxiTxj^,PBERT=∥x^∥1xi^∈x^Σx^∈x^maxxiTxj^,FBERT=2PBERT+RBERTPBERT⋅RBERT
based on inverse document frequency
i d f ( w ) = − log 1 M Σ i = 1 M I [ w ∈ x ( i ) ] idf(w) = -\log \frac{1}{M} \Sigma_{i=1}^{M} I [w \in x^{(i)}] idf(w)=−logM1Σi=1MI[w∈x(i)]
R ^ B E R T = R B E R T − b 1 − b \hat{R}_{BERT} = \frac{R_{BERT} - b}{1-b} R^BERT=1−bRBERT−b
b: empirical lower bound, calculated using Common Crawl monolingual datasets
machine translation evalution -> F B E R T F_{BERT} FBERT
text generation in Eglish -> 24-layer R o B E R T a l a r g e RoBERTa_{large} RoBERTalarge
non-English language -> B E R T m u l t i BERT_{multi} BERTmulti
paper: BLEURT: Learning Robust Metrics for Text Generation. ACL 2020
Bert + Linear Head
random perturbations of Wikipedia sentences augmented with a diverse set of lexical and semantic-level supervision signals
pretraining metrics: weighted sum of previous metrics
paper: BARTSCORE: Evaluating Generated Text as Text Generation
ExplainaBoard:http://explainaboard.nlpedia.ai/leaderboard/task-meval/
B A R T S C O R E = Σ t = 1 m ω t log p ( y t ∣ y < t , x , θ ) BARTSCORE = \Sigma_{t=1}^{m} \omega_t \log p(y_t | y_{<t}, x, \theta) BARTSCORE=Σt=1mωtlogp(yt∣y<t,x,θ)
using prompt to augment metrics
没太看明白,一开始列了一堆指标,最后又只有一个BARTScore。看了眼ExplainaBoard,猜测可能是评判的任务/输入数据对 { x , y } \{x,y\} {x,y}不同,BARTScore体现出的评判句子的方面就不一样
MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance. link
W M D ( x n , y n ) : = m i n F ∈ R ∣ x n ∣ × ∣ y n ∣ < C , F > , s . t . F 1 = f x n , F T 1 = f y n WMD(x^n, y^n) := \underset{F \in R^{|x^n| \times |y^n|}}{min} <C,F>, \quad s.t. F1 = f_{x^n}, F^T 1 = f_{y^n} WMD(xn,yn):=F∈R∣xn∣×∣yn∣min<C,F>,s.t.F1=fxn,FT1=fyn
C i j = d ( x i n , y j n ) C_{ij} = d(x_i^n, y_j^n) Cij=d(xin,yjn), the distance between the i-th n-gram of x and the j-th n-gram of y
F F F: transportation flow matrix, F i j F_{ij} Fij denoting the amount of flow traveling from the ith n-gram x i n x_i^n xin in x n x^n xn to the j-th n-gram y j n y_j^n yjn in y n y^n yn.
< C , F > = s u m ( C ⊙ F ) <C,F> = sum(C \odot F) <C,F>=sum(C⊙F)
d ( x i n , y j n ) d(x_i^n, y_j^n) d(xin,yjn) Euclidean distance
f x i n = 1 Z Σ k = i i + n − 1 i d f ( x k ) f_{x^n_i} = \frac{1}{Z} \Sigma_{k=i}^{i+n-1} idf(x_k) fxin=Z1Σk=ii+n−1idf(xk)
对于某个词,BERTScore算原句子中与它最相似的词的相似度(内积),而MoverScore算这个词和所有其他词的加权内积和,权重(即公式中的 F F F)通过idx算
直接计算生成文本和参考文本中词向量的平均值作为文本的向量表示,然后计算两个文本的余弦相似度作为生成文本和参考文本的相似度:
e
r
ˉ
=
Σ
ω
∈
r
e
ω
∣
Σ
ω
′
∈
r
e
ω
′
∣
\bar{e_r} = \frac{\Sigma_{\omega \in r} e_{\omega}}{| \Sigma_{\omega ' \in r} e_{\omega '}|}
erˉ=∣Σω′∈reω′∣Σω∈reω
E
A
:
=
c
o
s
(
e
r
ˉ
,
e
r
^
ˉ
)
EA := cos(\bar{e_r}, \bar{e_{\hat{r}}})
EA:=cos(erˉ,er^ˉ)
用的比较多,比较简单,但也受到句子各方面特征的影响,如长度,专有名词等。
很多现有的评价生成的文本的指标是基于Machine Translation任务的,计算原句子和翻译句子的相似度/匹配度。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。