Wooz Blog

Résumé

Recommended

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Wed Jan 22 2025

Prometheus: Inducing Fine-grained Evaluation Capability in Language Models

Tue Jan 21 2025

LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS

Tue Jan 14 2025

Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

Sat Jan 11 2025

Latent Dirichlet Allocation - Part.1

Thu Dec 07 2023

From Seq2Seq to Transformer - Part.2

Sat Oct 07 2023

From Seq2Seq to Transformer - Part.1

Wed Oct 04 2023

Generative AI - Part.2 🤖

Thu Jul 13 2023

Generative AI - Part.1 🤖

Wed Jul 12 2023

KL Divergence 🔀

Mon Sep 27 2021

Bayesian Probability 🧩

Sun Sep 26 2021

Vanilla Generative Adversarial Networks (GAN) Review 🤖

Fri Sep 24 2021

Adaptive Moment Estimation (ADAM) 🌈

Kwanwoo · Sat Oct 09 2021

🌈 Getting to Know Gradient Descent

Before we dive in, let's touch on the basics of Gradient Descent. This is a method we often use to minimize our cost function, or error, in machine learning algorithms. Sounds good, right? But there's a catch - sometimes, it can be a slow process.
This is where the learning rate comes into play. We can speed up convergence by increasing the learning rate, but in some scenarios this might cause the cost function to oscillate due to the shape characteristics of the function itself. If we're not careful, the cost function could even diverge. This means we often have to be patient and wait, even if the learning speed seems sluggish.
Figure 1. Gradient Descent
Figure 1. Gradient Descent
Let's look at Figure 1, where the parameter b oscillates before it eventually converges to the minimum value. This is due to the gradient vector's nature which points in the direction perpendicular to the contour line.
Figure 2. What we want
Figure 2. What we want
The ideal situation? We want to converge slowly in the vertical direction and quickly in the horizontal direction (check out Figure 2).
 

🚀Adding Momentum to Gradient Descent

Gradient Descent with Momentum is like an upgrade! It's designed to slow down the updates if the direction keeps changing in every iteration, and to speed up if the direction remains the same.

🎯Momentum Algorithm:

We start with

During the t-th iteration:

  1. We calculate for the current batch.
  1. Next, we compute:
  1. Finally, we update our Weight and Bias as follows:
    1. (where α is the learning rate)
       
The core of the Momentum algorithm is Equations (1) and (2), but since the form is almost the same, let's solve Equation (1) a little more and think about it.
Equation (1) is a recursively calculated term, which will be as follows if we look at it step by step from iteration 1.
iteration 1
iteration 2
iteration 3
Normalize
normally
As we iterate, we can see that gradients in the b-axis that change up and down will gradually approach zero, while those in the W-axis will continue to speed up. This shows that the momentum algorithm adjusts the learning speed for each parameter appropriately.
 

🌪️RMSProp: Harnessing the Power of Propagation

RMSProp, although not officially presented in an academic paper, has proven its worth in the machine learning field. It's quite similar to Gradient Descent with Momentum, but uses the size of the gradient to adjust the learning speed for each parameter.

🎯RMSProp Algorithm:

We start with

During the t-th iteration:

  1. We calculate for the current batch.
  1. Next, we compute:
  1. Finally, we update our Weight and Bias as follows:
    1. (where α is the learning rate)
RMSProp adjusts the learning rate for each parameter based on the size of the gradient, thus allowing for efficient learning.
 

🎯ADAM: The Best of Both Worlds

ADAM (Adaptive Moment Estimation) is like a supercharged method that uses both Gradient Descent with Momentum and RMSProp.

The ADAM algorithm:

We start with

During the t-th iteration:

  1. We calculate for the current batch.
  1. Next, we compute:
  1. Finally, we update our Weight and Bias as follows:
    1. (where α is the learning rate)
The creators of the ADAM algorithm suggest using , and for best results. With ADAM, you're combining the advantages of two powerful algorithms for efficient optimization!
 
arxiv.org
arxiv.org
https://arxiv.org/pdf/1412.6980.pdf
Feedback
Older Article

Linear Regression 📉

Newer Article

Gradient Descent 🏔️

© 2023 Wooz Labs.