赞
踩
全参数微调(FFT) VS 参数高效微调(PEFT)
全参数微调的问题:
高效微调:
Parameter-Efficient Transfer Learning for NLP
故事:
方法:
细节:
BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning
故事:
方法:
h l + 1 = L N ( h l + S A ( h l ) + T S ( h l ) ) \mathbf{h}^{l+1}=\mathrm{LN}(\mathbf{h}^l+\mathrm{SA}(\mathbf{h}^l)+\mathrm{TS}(\mathbf{h}^l)) hl+1=LN(hl+SA(hl)+TS(hl))
T S ( h ) = V D g ( V E h ) \mathrm{TS}(\mathbf{h})=V^{D}g(V^{E}\mathbf{h}) TS(h)=VDg(VEh)
细节:
结果:
AdapterFusion: Non-Destructive Task Composition for Transfer Learning
故事:
方法:
细节:
结果:
AdapterDrop: On the Efficiency of Adapters in Transformers
故事:
方法:
细节:
结果:
Parameter-Efficient Transfer Learning with Diff Pruning
故事:
希望不修改原模型的体系结构,通过特定任务的difference vector来扩展模型,提升下游任务性能
通过学习神经网络权重的稀疏更新来实现高效参数迁移
方法:
δ τ = z τ ⊙ w τ , z τ ∈ { 0 , 1 } d , w τ ∈ R d \mathbf{\delta}_\tau=\mathbf{z}_\tau\odot\mathbf{w}_\tau,\quad\mathbf{z}_\tau\in\{0,1\}^d,\mathbf{w}_\tau\in\mathbb{R}^d δτ=zτ⊙wτ,zτ∈{0,1}d,wτ∈Rd
这里的 δ τ \mathbf{\delta}_\tau δτ表示添加的diff vetor
由于其与其他PEFT相比涉及优化所有参数,在当前的模型规模下并不常用,所以这里细节不展开赘述
结果:
Prefix-Tuning: Optimizing Continuous Prompts for Generation
故事:
方法:
细节:
结果:
The Power of Scale for Parameter-Efficient Prompt Tuning
故事:
方法:
细节:
结果:
WARP: Word-level Adversarial ReProgramming
故事:
方法:
细节:
结果:
LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS
故事:
方法:
h = W 0 x + Δ W x = W 0 x + B A x h = W_0x+\Delta{W}x=W_0x+BAx h=W0x+ΔWx=W0x+BAx
细节:
结果:(比较多,这里只放了一个基于Roberta的)
GPT Understands, Too
故事:
方法:
细节:
结果:
P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks
故事:
方法:
细节:
结果:
BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models
故事:
方法:
细节:
结果:
TOWARDS A UNIFIED VIEW OF PARAMETER-EFFICIENT TRANSFER LEARNING
故事:
A d a p t e r : h ← h + f ( h W d o w n ) W u p Adapter:h\leftarrow h+f(h\boldsymbol{W_\mathrm{down}})\boldsymbol{W_\mathrm{up}} Adapter:h←h+f(hWdown)Wup
p r e f i x − t u n i n g : head i = Attn ( x W q ( i ) , concat ( P k ( i ) , C W k ( i ) ) , concat ( P v ( i ) , C W v ( i ) ) ) prefix-tuning:\operatorname{head}_{i}=\operatorname{Attn}(\boldsymbol{xW}_{q}^{(i)},\operatorname{concat}(\boldsymbol{P}_{k}^{(i)},\boldsymbol{CW}_{k}^{(i)}),\operatorname{concat}(\boldsymbol{P}_{v}^{(i)},\boldsymbol{CW}_{v}^{(i)})) prefix−tuning:headi=Attn(xWq(i),concat(Pk(i),CWk(i)),concat(Pv(i),CWv(i)))
L o R A : h ← h + s ⋅ x W d o w n W u p LoRA:h\leftarrow h+s\cdot xW_\mathrm{down}W_\mathrm{up} LoRA:h←h+s⋅xWdownWup
方法:
h ← ( 1 − λ ( x ) ) h + λ ( x ) f ( x W 1 ) W 2 \boldsymbol{h}\leftarrow(1-\lambda(\boldsymbol{x}))\boldsymbol{h}+\lambda(\boldsymbol{x})f(\boldsymbol{xW}_1)\boldsymbol{W}_2 h←(1−λ(x))h+λ(x)f(xW1)W2
细节:(Table 1是几种方法的总结,包括公式,插入形式,插入位置)
结果:
UNIPELT: A Unified Framework for Parameter-Efficient Language Model Tuning
故事:
方法:(图中蓝色部分为微调过程中需要调整的参数模块)
细节:
结果:
ADALORA: ADAPTIVE BUDGET ALLOCATION FOR PARAMETER-EFFICIENT FINE-TUNING
故事:
图1可以看到选择不同的权重矩阵,或者不同transformer层的权重矩阵都会对模型的性能有较大的影响
所以本文想要解决的问题:
How can we allocate the parameter budget adaptively according to importance
of modules to improve the performance of parameter-efficient fine-tuning?
方法:
W = W ( 0 ) + Δ = W ( 0 ) + P Λ Q W=W^{(0)}+\Delta=W^{(0)}+P\Lambda{Q} W=W(0)+Δ=W(0)+PΛQ
细节:
结果:(其一)
当前的PEFT方法主要分为四类:
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。