Adam family# Algorithm of Adam:
g t ← ∇ θ f t ( θ t − 1 ) m t ← β 1 m t − 1 + ( 1 − β 1 ) g t v t ← β 2 v t − 1 + ( 1 − β 2 ) g t 2 m ^ t ← m t 1 − β 1 t , v ^ t ← v t 1 − β 2 t θ t ← θ t − 1 − η m t ^ v t ^ + ϵ \begin{aligned}
g_{t} &\gets \nabla_{\theta} f_t(\theta_{t-1})\\
m_{t} &\gets \beta_{1} m_{t-1} + (1-\beta_{1}) g_t\\
v_{t} &\gets \beta_{2} v_{t-1} + (1-\beta_{2}) g_t^2\\
\widehat{m}_{t} &\gets \frac{m_{t}}{1-\beta_1^t},\ \widehat{v}_{t} \gets \frac{v_{t}}{1-\beta_2^t}\\
\theta_{t} &\gets \theta_{t-1} - \eta\frac{\widehat{m_t}}{\sqrt{\widehat{v_t}} + \epsilon}
\end{aligned} g t m t v t m t θ t ← ∇ θ f t ( θ t − 1 ) ← β 1 m t − 1 + ( 1 − β 1 ) g t ← β 2 v t − 1 + ( 1 − β 2 ) g t 2 ← 1 − β 1 t m t , v t ← 1 − β 2 t v t ← θ t − 1 − η v t + ϵ m t
m t ^ \widehat{m_t} m t : the normalized state moment
v t ^ \widehat{v_t} v t : the normalized state second moment (i.e., the squared gradient);
an estimation of the uncentered variance of the gradients.
Uncentered variance is a statistical measure that describes the spread of a set of values around the mean. In the context of optimization, the moving average of the second moments is used to compute the uncentered variance in the Adam optimization algorithm.
Note: Not implemented bias_correction for easily read. If with bias_correction as follows:
m_t_hat = m_t / ( 1 - beta1^ t)
v_t_hat = v_t / ( 1 - beta2^ t) More detailed in a function _single_tensor_adam of PyTorch Docs
Adam# Equation:
g t ← ∇ θ f t ( θ t − 1 ) + λ θ t − 1 , if with L2 regularization \begin{aligned}
g_{t} \gets \nabla_{\theta} f_t(\theta_{t-1}) + \lambda\theta_{t-1},\footnotesize\text{ if with \rm{L2} regularization}
\end{aligned} g t ← ∇ θ f t ( θ t − 1 ) + λ θ t − 1 , if with L2 regularization Implementation code:
if weight_decay:
gradient += weight_decay * param
m_t = beta1 * m_{ t- 1 } + ( 1 - beta1) * gradient
v_t = beta2 * v_{ t- 1 } + ( 1 - beta2) * gradient^ 2
param = param - lr * m_t / ( sqrt( v_t) + epsilon) AdamW# Equation:
θ t ← θ t − 1 − η ( m t ^ v t ^ + ϵ + λ θ t − 1 ) \theta_{t} \gets \theta_{t-1} - \eta\left(\frac{\widehat{m_t}}{\sqrt{\widehat{v_t}} + \epsilon} + \lambda\theta_{t-1}\right) θ t ← θ t − 1 − η ( v t + ϵ m t + λ θ t − 1 ) Implementation code:
m_t = beta1 * m_{ t- 1 } + ( 1 - beta1) * gradient
v_t = beta2 * v_{ t- 1 } + ( 1 - beta2) * gradient^ 2
param = param - lr * ( m_t / ( sqrt( v_t) + epsilon) + weight_decay * param)
param *= ( 1 - lr * weight_decay)
m_t = beta1 * m_{ t- 1 } + ( 1 - beta1) * gradient
v_t = beta2 * v_{ t- 1 } + ( 1 - beta2) * gradient^ 2
param = param - lr * m_t / ( sqrt( v_t) + epsilon) Defference of weight_decay between AdamW and Adam#
In Adam, the weight decay term is added to the gradient.
In AdamW, the weight decay term is added directly to the model parameters, and the momentum term is unaffected by weight decay.