Personal Assistant
Home Settings
Daily Digest Newsletters Papers Ruby Posts AI Posts Ruby: Blogs and News AI: Blogs and News Gem Updates Gem Discoveries Digest Tweets
Twitter Lists Bluesky Lists RSS Lists Tracked Gems
Sign in Explore
@_avichawla

Avi Chawla

@_avichawla

I have been fine-tuning LLMs for over 2 years now! Here are the top 5 LLM fine-tuning techniques, explained with visuals: First of all, what's so different about LLM finetuning? Traditional fine‑tuning is impractical for LLMs (billions of params; 100s GB). Since this kind of compute isn't accessible to everyone, parameter-efficient finetuning (PEFT) came into existence. Before we go into details of each technique, here's some background that will help you better understand these techniques: LLM weights are matrices of numbers adjusted during finetuning. Most PEFT techniques involve finding a lower-rank adaptation of these matrices, a smaller-dimensional matrix that can still represent the information stored in the original. Now with a basic understanding of the rank of a matrix, we're in a good position to understand the different finetuning techniques. (refer to the image below for a visual explanation of each technique) 1) LoRA - Add two low-rank trainable matrices, A and B, alongside weight matrices. - Instead of fine-tuning W, adjust the updates in these low-rank matrices. Even for the largest of LLMs, LoRA matrices take up a few MBs of memory. 2) LoRA-FA While LoRA significantly decreases the total trainable parameters, it requires substantial activation memory to update the low-rank weights. LoRA-FA (FA stands for Frozen-A) freezes matrix A and only updates matrix B. 3) VeRA - In LoRA, low-rank matrices A and B are unique for each layer. - In VeRA, A and B are frozen, random, and shared across all layers. - Instead, it learns layer-specific scaling VECTORS (b and d) instead. 4) Delta-LoRA - It tunes the matrix W as well, but not in the traditional way. - Here, the difference (or delta) between the product of matrices A and B in two consecutive training steps is added to W. 5) LoRA+ - In LoRA, both matrices A and B are updated with the same learning rate. - Authors of LoRA+ found that setting a higher learning rate for matrix B results in better convergence. ____ Find me → @_avichawla Every day, I share tutorials and insights on DS, ML, LLMs, and RAGs.

Post media
8:18 AM · Apr 1, 2026