Summary of Optimizer
Algorithm of Adam:
Uncentered variance is a statistical measure that describes the spread of a set of values around the mean. In the context of optimization, the moving average of the second moments is used to compute the uncentered variance in the Adam optimization algorithm.
Note: Not implemented
bias_correction
for easily read.
If withbias_correction
as follows:# bias_correction_ = (1 - beta_^t) m_t_hat = m_t / (1 - beta1^t) v_t_hat = v_t / (1 - beta2^t)
More detailed in a function
_single_tensor_adam
of PyTorch Docs
Equation:
Implementation code:
if weight_decay:
gradient += weight_decay * param # L2 Regularization
m_t = beta1 * m_{t-1} + (1 - beta1) * gradient
v_t = beta2 * v_{t-1} + (1 - beta2) * gradient^2
param = param - lr * m_t / (sqrt(v_t) + epsilon)
Equation:
Implementation code:
# the original paper's implementation
m_t = beta1 * m_{t-1} + (1 - beta1) * gradient
v_t = beta2 * v_{t-1} + (1 - beta2) * gradient^2
param = param - lr * (m_t / (sqrt(v_t) + epsilon) + weight_decay * param)
# PyTorch's implementation
param *= (1 - lr * weight_decay)
m_t = beta1 * m_{t-1} + (1 - beta1) * gradient
v_t = beta2 * v_{t-1} + (1 - beta2) * gradient^2
param = param - lr * m_t / (sqrt(v_t) + epsilon)
weight_decay
between AdamW and Adam#