Gradient Descent - Data Science

Gradient Descent is an optimization algorithm used to minimize a function, typically a loss function, by iteratively adjusting the parameters of a model. It aims to find the optimal values for the parameters that minimize the difference between the predicted outputs of the model and the actual targets.

Here’s a high-level overview of how Gradient Descent works:

Initialize Parameters: The algorithm starts by initializing the model’s parameters, such as weights and biases, with random values or predefined techniques.
Calculate the Gradient: The gradients of the parameters with respect to the loss function are computed using techniques like backpropagation in neural networks. The gradients represent the direction and magnitude of the steepest ascent of the function.
Update Parameters: The parameters are updated iteratively by taking steps in the opposite direction of the gradients. This is done to descend along the gradient and move closer to the optimal values.
Learning Rate: The step size taken in each iteration is determined by a hyperparameter called the learning rate. It controls the size of the parameter update and affects the convergence speed and stability of the optimization process.
Repeat Until Convergence: Steps 2 to 4 are repeated until a stopping criterion is met. This criterion can be a maximum number of iterations, reaching a predefined threshold for the loss function, or observing a minimal change in the parameters.

By iteratively updating the parameters based on the gradients, Gradient Descent tries to find the parameter values that minimize the loss function. The process continues until the algorithm converges to a satisfactory solution or reaches the stopping criterion.

There are different variations of Gradient Descent, including:

Batch Gradient Descent: Computes gradients using the entire training dataset in each iteration.
Stochastic Gradient Descent (SGD): Computes gradients using a randomly selected subset of training examples (mini-batch) in each iteration.
Mini-Batch Gradient Descent: Computes gradients using a small randomly selected batch of training examples in each iteration, striking a balance between batch GD and SGD.

Gradient Descent is a fundamental optimization algorithm in machine learning and is widely used in various models, such as linear regression, logistic regression, and neural networks. It provides a way to iteratively update the model’s parameters to find the values that minimize the loss function and improve the model’s performance.

Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is an optimization algorithm commonly used in machine learning for training models. It is an extension of the standard Gradient Descent algorithm that offers several advantages, particularly in large-scale datasets.

Here’s an overview of how Stochastic Gradient Descent works:

Dataset and Mini-Batches:
- SGD operates on a dataset divided into smaller subsets called mini-batches.
- Each mini-batch consists of a few training examples randomly sampled from the entire dataset.
Iterative Updates:
- The model’s parameters (weights and biases) are initialized randomly or using predefined techniques.
- SGD iteratively updates the model’s parameters based on the gradients calculated from the mini-batches.
Gradient Calculation:
- For each mini-batch, the gradients of the model’s parameters with respect to the loss function are computed.
- The gradients are obtained by backpropagation, where the errors are propagated backward through the network to calculate the gradients.
Parameter Update:
- After computing the gradients, the model’s parameters are updated using the update rule: θ = θ – α * ∇J(θ), where θ represents the parameters, α denotes the learning rate, and ∇J(θ) represents the gradients.
Repeat Until Convergence:
- The process of calculating gradients and updating parameters is repeated for multiple iterations or until a stopping criterion is met.
- The iterations continue until the model converges to a satisfactory solution or a maximum number of iterations is reached.

Key Characteristics of Stochastic Gradient Descent:

Efficiency with Large Datasets: SGD is computationally efficient for large datasets as it only requires a small mini-batch to calculate the gradients at each iteration. This makes SGD well-suited for deep learning and large-scale machine learning tasks.
Faster Convergence: The frequent parameter updates based on mini-batches make SGD converge faster compared to standard Gradient Descent. This is because each update is based on a smaller subset of data, leading to more frequent and potentially more informative updates.
Stochastic Nature: As the name suggests, SGD introduces randomness due to the random selection of mini-batches. This stochasticity can sometimes help escape local minima and explore a larger space of solutions.
Noisy Gradient Estimates: Since each mini-batch is a random sample from the dataset, the gradients computed using the mini-batch are noisy estimates of the true gradients. This noise introduces variance in the parameter updates, but it also helps to regularize the model and prevent overfitting.
Learning Rate Selection: The learning rate (α) is an important hyperparameter in SGD. It determines the step size for parameter updates and affects the convergence speed and stability of the optimization process. Selecting an appropriate learning rate is crucial to ensure convergence and prevent overshooting or slow convergence.

SGD is a widely used optimization algorithm, particularly in deep learning, due to its efficiency and ability to handle large datasets. However, it may require careful tuning of hyperparameters and monitoring to achieve optimal performance and convergence. Various extensions, such as learning rate schedules and momentum, have been developed to enhance SGD’s performance and stability in practice.

Momentum

In the context of optimization algorithms in AI, momentum is a technique used to improve the convergence speed and stability of gradient-based optimization methods such as Gradient Descent. It introduces a notion of inertia or momentum to the parameter updates, allowing the optimization process to “carry over” information from previous updates.

Here’s a brief explanation of how momentum works:

Standard Gradient Descent Update:
- In standard Gradient Descent, the update of model parameters at each iteration is based solely on the current gradient of the loss function with respect to the parameters.
- The update rule can be expressed as: θ(t+1) = θ(t) – α * ∇J(θ(t)), where θ(t) represents the parameters at iteration t, α denotes the learning rate, and ∇J(θ(t)) represents the gradient.
Introduction of Momentum:
- Momentum introduces a new term that accounts for the accumulated momentum from previous updates.
- At each iteration, a momentum term (denoted by β) is multiplied by the previous update and added to the current update.
- The update rule with momentum can be expressed as: v(t+1) = β * v(t) + α * ∇J(θ(t)), θ(t+1) = θ(t) – v(t+1).
- Here, v(t) represents the velocity or momentum at iteration t, and β is a hyperparameter known as the momentum coefficient.
Benefits of Momentum:
- Faster Convergence: Momentum allows the optimization algorithm to accumulate velocity in directions that consistently decrease the loss function, resulting in faster convergence.
- Smoother Optimization: By incorporating information from previous updates, momentum helps smooth out the oscillations that can occur during optimization, leading to more stable and consistent updates.
- Escape from Local Minima: Momentum can help the optimization process overcome shallow local minima or plateaus by carrying momentum and moving towards steeper regions of the loss landscape.
Tuning the Momentum Coefficient:
- The value of the momentum coefficient (β) typically ranges between 0 and 1. A higher value increases the influence of previous updates, while a lower value reduces it.
- Commonly used values for β range from 0.9 to 0.99, but it can vary depending on the specific problem and the characteristics of the optimization landscape.
- The momentum coefficient is typically tuned alongside other hyperparameters, such as the learning rate, and its optimal value may vary for different models and datasets.

Momentum is a widely used technique in AI and deep learning because it can significantly improve the optimization process, especially in scenarios with complex loss landscapes and large datasets. By introducing inertia into the parameter updates, momentum helps accelerate convergence, smooth out optimization trajectories, and enhance the ability to escape local minima.

Learning Rate Schedules (Choosing the Optimal Learning Rate)

The Learning Rate (η) should be small enough so we gently descend, instead of oscillating or diverging, and big enough so we reach it in a rational amount of time. Learning Rate Schedules can help.

Here is a basic example of Learning Rate Schedule:

1. We start from a high initial learning rate	First 5 epochs η = 0.1
2. At some point we lower the rate to avoid oscillations	Next 5 epochs η = 0.01
3. Around the end we pick a very small rate to get a precise answer	Until the end η = 0.001

The Exponential Schedule

This is usually a better option as the learning rate is gradually reduced. We usually start at a high rate like η₀ = 0.1. Then we update the learning rate at each epoch using the equation:

\[ \eta = \eta_0e^{-n/c} \]

Where:

n = the number of the current epoch.
c = a constant usually between 20-30. This is another hyperparameter.

AdaGrad

AdaGrad = Adaptive Gradient Algorithm
AdaGrad dynamically varies the learning rate at each update and for each weight individually.
Here is the equation for the change in weight after each update:

\[ \Delta w_i(t) = – \frac{\eta}{\sqrt{G_i(t)} + \epsilon} \frac{\partial L}{\partial w_i}(t) \]

where:

\[ G_i(t) = G_i(t-1) + (\frac{\partial L}{\partial w_i}(t))^2 \]

with beginning point G_i(0) = 0.

w_i = the weight index. The update rule is individual for each weight.
t = iteration (time) at which we are updating.

RMSprop

RMPprop = Root Mean Square Propagation
Very similar to the AdaGrad, but with the added weight β.

\[ \Delta w_i(t) = – \frac{\eta}{\sqrt{G_i(t)} + \epsilon} \frac{\partial L}{\partial w_i}(t) \]

where:

\[ G_i(t) = \beta G_i(t-1) + (1 – \beta) (\frac{\partial L}{\partial w_i}(t))^2 \]

with beginning point G_i(0) = 0.

β is another hyperparameter.
Ranges from 0-1.
Usually set ~ 0.9.

Adam (Adaptive Moment Estimation)

Superior to both AdaGrad and RMSprop.
Adam introduces Momentum into the equations of AdaGrad and RMSprop.

\[ \Delta w_i(t) = – \frac{\eta}{\sqrt{G_i(t)} + \epsilon} M_i(t) \]

where:

\[ M_i(t) = \alpha M_i(t-1) + (1 – \alpha) \frac{\partial L}{\partial w_i}(t) \]

with beginning point M_i(0) = 0.