~Terms related to optimization algorithms~
Gradient Descent
Gradient Descent is an algorithm for optimizing the parameters of a machine learning model. It is a method for adjusting the parameters in the direction that minimizes the error based on the gradient (slope) of the loss function. Specifically, the parameters are updated in the direction opposite to the gradient at the current position. The gradient descent method repeatedly updates the parameters until the loss function reaches its minimum value, and finds the optimal solution. The update step is expressed by the following formula:
$\theta = \theta – \eta \cdot \nabla J(\theta)$
where, is the parameter, is the learning rate, and is the gradient of the loss function with respect to the parameter. The basic form of gradient descent is “batch gradient descent”, which processes all training data at once to calculate the gradient of the loss function, but it is computationally expensive and inefficient, especially for large datasets. To improve this, stochastic gradient descent and mini-batch gradient descent were developed.
Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent (SGD) is a type of gradient descent method that, instead of using all the training data at once, randomly selects one sample (or a small number of samples) at each step to calculate the gradient of the loss function and update the parameters. This greatly speeds up calculations, making learning particularly efficient for large datasets. The SGD update formula is as follows:
[ \theta = \theta – \eta \cdot \nabla J(\theta; x^{(i)}, y^{(i)}) ]
where, (x^((i)), y^((i)) ) is one sample of the dataset. The advantage of SGD is that it is fast and can update in real time, but the disadvantage is that the parameter updates are easily affected by noise, and the loss function is difficult to converge stably. To improve this, techniques such as momentum and Nesterov accelerated gradient method are often used.
Mini-Batch Gradient Descent
Mini-Batch Gradient Descent is a method that falls somewhere between Batch Gradient Descent and Stochastic Gradient Descent (SGD). It divides the training data into small “mini-batches”, calculates the gradient for each mini-batch, and updates the parameters. The mini-batch size (usually around 32 to 256 samples) is set as a hyperparameter. The update formula for mini-batch gradient descent is as follows:
[ \theta = \theta – \eta \cdot \frac{1}{m} \sum_{i=1}^{m} \nabla J(\theta; x^{(i)}, y^{(i)}) ]
where m is the size of the mini-batch. The mini-batch gradient descent method has the advantage of being less noisy and more computationally efficient than SGD. It can also be used in parallel using GPUs or distributed environments, which improves training speed. The mini-batch gradient descent method is used as standard in most modern deep learning frameworks, and it enables efficient training.
Momentum
Momentum is a technique that accelerates the convergence of gradient descent and suppresses oscillations in the learning process. Momentum gives the current update an influence by giving it “inertia” based on past gradients. This increases the speed of updates when the gradient is heading in a certain direction, and has the effect of reducing the risk of getting caught in local minima and zigzag movements. The momentum update formula is as follows:
[ v_t = \beta v_{t-1} + (1 – \beta) \nabla J(\theta) ]
[ \theta = \theta – \eta v_t ]
where, the coefficient of momentum (usually around 0.9) adjusts the influence of the gradient of the past. Momentum accelerates the movement in areas where the gradient is small, and maintains stability in places where the gradient fluctuates greatly, so it makes SGD converge faster and enables more stable learning.
Nesterov Accelerated Gradient (NAG)
Nesterov Accelerated Gradient (NAG) is an optimization method that improves on momentum, and aims to find the minimum value of the loss function more accurately by adding a function that predicts the momentum. NAG is characterized by calculating the gradient at the position predicted by the momentum, rather than the current position. This makes parameter updates more accurate and prevents over-correction. The NAG update formula is as follows:
[ v_t = \beta v_{t-1} + (1 – \beta) \nabla J(\theta – \eta \beta v_{t-1})
[ \theta = \theta – \eta v_t ]
NAG stabilizes learning, especially on complex loss function landscapes, by computing gradients more precisely while maintaining the momentum’s acceleration effect. In the field of deep learning, methods that add momentum to SGD are standard, but NAG provides even more efficient convergence.
AdaGrad
AdaGrad (Adaptive Gradient Algorithm) is an extension of the gradient descent method, an algorithm that improves optimization efficiency by automatically adjusting the learning rate for each parameter. In the conventional gradient descent method, the same learning rate is applied to all parameters, but in AdaGrad, the learning rate of parameters that are frequently updated is reduced, and the learning rate of parameters that are not updated very often is increased, to improve the stability and efficiency of learning. The update formula is as follows:
[ \theta_{t+1} = \theta_t – \frac{\eta}{\sqrt{G_t + \epsilon}} \cdot \nabla J(\theta_t) ]
where, G_t is the accumulated squared sum of past gradients, and epsilon is a small value to prevent division by zero. AdaGrad is effective for sparse datasets (e.g. natural language processing and recommendation systems), and can also be adapted to problems with features of different scales. However, if the cumulative sum of the squared gradients becomes too large, the learning rate will become extremely small, and learning will stop.
RMSProp
RMSProp (Root Mean Square Propagation) is an optimization algorithm that improves on the shortcomings of AdaGrad, and uses an exponential moving average to control the accumulation of the sum of the squares of the gradients in order to prevent the learning rate from decreasing. This gradually reduces the impact of the gradients over a long period of time, and alleviates the problem of learning stopping. The update formula for RMSProp is as follows:
[ E[g^2]t = \beta E[g^2]{t-1} + (1 – \beta) g_t^2 ]
[ \theta_{t+1} = \theta_t – \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \cdot \nabla J(\theta_t) ]
Here, E[g^2]_t is the moving average of the gradient squared, beta is the decay rate (usually 0.9), and epsilon is a small value to maintain numerical stability. RMSProp performs particularly well in tasks where the gradient is likely to fluctuate, such as online learning and recurrent neural networks (RNNs). It is also suitable for long-term learning because it flexibly handles gradient history and adaptively adjusts the learning rate.
Adam Optimizer
Adam (Adaptive Moment Estimation) is an evolution of gradient descent, an optimization algorithm that combines RMSProp and momentum. Adam updates parameters while considering both the first-order moment (moving average) and second-order moment (moving average of the square of the gradient) of the gradient. This allows the learning rate to be adjusted adaptively, leading to fast and stable learning. The update equations are as follows:
[ m_t = \beta_1 m_{t-1} + (1 – \beta_1) g_t
[ v_t = \beta_2 v_{t-1} + (1 – \beta_2) g_t^2
[ \hat{m}t = m_t / (1 – \beta_1^t), \quad \hat{v}_t = v_t / (1 – \beta_2^t) ] [ \theta{t+1} = \theta_t – \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t ]
where, ( m_t ) and ( v_t ) are the moving average and the squared moving average of the gradient, respectively, and ( \beta_1 ) and ( \beta_2 ) are the moment coefficients (usually 0 .9 and 0.999), and is a small constant used to maintain numerical stability. Adam is a very widely used optimization method because of its high convergence speed and learning stability. It is particularly effective for complex models in the field of deep learning.
AdaDelta
AdaDelta is an optimization algorithm that limits the cumulative effect of gradients in AdaGrad, and improves the problem of excessively decreasing learning rates. AdaDelta uses a moving average of gradients to update parameters, so that it does not rely too much on gradient history. This prevents the learning rate from becoming extremely small, as in AdaGrad. The update formula for AdaDelta is as follows:
[ E[g^2]t = \rho E[g^2]{t-1} + (1 – \rho) g_t^2 ]
[ \Delta \theta_t = – \frac{\sqrt{E[\Delta \theta^2]{t-1} + \epsilon}}{\sqrt{E[g^2]_t + \epsilon}} g_t ] [ E[\Delta \theta^2]_t = \rho E[\Delta \theta^2]{t-1} + (1 – \rho) (\Delta \theta_t)^2 ]
where, ( \rho ) is the decay rate, ( E[g^2]_t ) is the moving average of the squared gradient, and ( E[\Delta \theta^2]_t ) is the moving average of the parameter update amount. AdaDelta eliminates the need for manual setting of the learning rate and dynamically adjusts the learning rate for all parameters, so it can operate stably even with large datasets and complex models.
Adamax
Adamax is an extension of the Adam optimizer, and is an optimization algorithm that uses the L∞ norm. Adamax is particularly effective when the scale of the parameters is large, and in some cases it is more numerically stable than Adam. The update formula for Adamax is basically the same as Adam, but the difference is that it uses the L∞ norm to calculate the quadratic moment. Specifically, the update formula is expressed as follows:
[ m_t = \beta_1 m_{t-1} + (1 – \beta_1) g_t
[ u_t = \max(\beta_2 u_{t-1}, |g_t|)
[ \theta_{t+1} = \theta_t – \frac{\eta}{u_t} m_t ]
Here, ( u_t ) denotes the moving average of the gradient’s L∞ norm. Adamax inherits the characteristics of Adam, and functions effectively in tasks where scaling and numerical stability are important.
Nadam
Nadam (Nesterov-accelerated Adaptive Moment Estimation) is an extended version of the Adam optimizer, an optimization algorithm that incorporates the concepts of the Nesterov Accelerated Gradient (NAG) method. While Adam uses a combination of momentum and gradient moving averages to adaptively adjust the learning rate, Nadam uses Nesterov’s lookahead to update the gradient, resulting in more accurate gradient calculations. This is expected to result in faster and more stable convergence than Adam. The Nadam update formula is as follows:
[ m_t = \beta_1 m_{t-1} + (1 – \beta_1) g_t ]
[ v_t = \beta_2 v_{t-1} + (1 – \beta_2) g_t^2
[ \hat{m}t = \frac{m_t}{1 – \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 – \beta_2^t} [ \theta{t+1} = \theta_t – \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} ( \beta_1 \hat{m}_t + (1 – \beta_1) g_t ) ]
Nadam is particularly effective for learning sequential data such as RNNs and LSTMs, and in some cases it is seen to have an advantage over Adam in terms of convergence speed and stability.
AMSGrad
AMSGrad is a modified version of the Adam optimizer, an optimization algorithm developed to alleviate the problem of “overfitting”, a potential drawback of Adam. Adam updates parameters using a moving average of the squared gradient, but AMSGrad retains the previous maximum value of the squared moving average and uses this for parameter updating. This prevents the learning rate from becoming extremely large, and stable convergence is expected. The update formula for AMSGrad is as follows:
[ v_t = \beta_2 v_{t-1} + (1 – \beta_2) g_t^2
[ \hat{v}t = max(v{t-1}, v_t)
[ \theta_{t+1} = \theta_t – \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} m_t ]
AMSGrad is an optimization method suitable for long-term learning and large datasets, and in particular, it is less likely to get stuck in local minima, providing better convergence.
Learning Rate Decay
Learning Rate Decay is a method of gradually decreasing the learning rate during training of a neural network. In the initial training phase, large steps are taken with a high learning rate, and as training progresses, the learning rate is reduced to make fine adjustments, which stabilizes convergence to the optimal solution. Typical decay methods include the following approaches
- Step decay: The learning rate is decreased at regular intervals.
[ \eta_t = \eta_0 \cdot \gamma ^{\lfloor \frac{t}{T} \rfloor} ]
where, ( \eta_0 ) is the initial learning rate, ( \gamma ) is the decay rate, and ( T ) is the timing of the decay. - Exponential decay: The learning rate decreases exponentially over time.
[ \eta_t = \eta_0 e^{-\lambda t} ] - Cosine decay: The learning rate decreases along a cosine curve.
By using learning rate decay appropriately, it is expected that the model will converge to the optimal solution efficiently. This is particularly effective for tasks that require fine-tuning in the latter half of training.
Early Stopping
Early stopping is a method of stopping training halfway through to prevent overfitting. Normally, the model will continue to improve in terms of accuracy for the training data, but as overfitting progresses, the performance for the test data and validation data will begin to decline. Early stopping ends training when the validation loss no longer improves. Specifically, it is a mechanism that stops when the validation loss does not improve over several epochs. Early stopping is implemented in the following flow:
- Monitor the validation loss during training.
- Training is terminated if the loss does not improve for more than the set number of epochs.
Early termination reduces training costs and stops learning at the optimal number of epochs, preventing the model from overfitting to the data and improving generalization ability.
Comments