赞
踩
AdamW算法是优化算法Adam的一个变体,它在深度学习中广泛应用。AdamW的主要改进在于它正则化方法的改变,即通过权重衰减(weight decay)而不是L2正则化,来控制模型参数的大小,从而提升了训练的稳定性和效果。
AdamW的更新公式与Adam类似,但引入了显式的权重衰减项。以下是AdamW的核心公式:
偏移修正的动量估计:
m
t
=
β
1
m
t
−
1
+
(
1
−
β
1
)
g
t
m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t
mt=β1mt−1+(1−β1)gt
v
t
=
β
2
v
t
−
1
+
(
1
−
β
2
)
g
t
2
v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2
vt=β2vt−1+(1−β2)gt2
偏移修正:
m
^
t
=
m
t
1
−
β
1
t
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}
m^t=1−β1tmt
v
^
t
=
v
t
1
−
β
2
t
\hat{v}_t = \frac{v_t}{1 - \beta_2^t}
v^t=1−β2tvt
参数更新:
θ
t
=
θ
t
−
1
−
η
m
^
t
v
^
t
+
ϵ
−
η
λ
θ
t
−
1
\theta_t = \theta_{t-1} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} - \eta \lambda \theta_{t-1}
θt=θt−1−ηv^t
+ϵm^t−ηλθt−1
其中:
import torch import torch.optim as optim from torch.utils.data import DataLoader, TensorDataset # 定义数据集和数据加载器 data = torch.randn(1000, 10) # 假设有1000个样本,每个样本有10个特征 labels = torch.randint(0, 2, (1000,)) # 假设二分类任务 dataset = TensorDataset(data, labels) data_loader = DataLoader(dataset, batch_size=32, shuffle=True) # 定义模型 model = torch.nn.Linear(10, 2) criterion = torch.nn.CrossEntropyLoss() # 创建AdamW优化器 optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01) # 训练循环 num_epochs = 100 for epoch in range(num_epochs): for batch_data, batch_labels in data_loader: optimizer.zero_grad() outputs = model(batch_data) loss = criterion(outputs, batch_labels) loss.backward() optimizer.step() # 打印每个epoch的损失 print(f'Epoch {epoch+1}/{num_epochs}, Loss: {loss.item()}')
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。