Personal Assistant
Home Settings
Daily Digest Newsletters Papers Ruby Posts AI Posts Ruby: Blogs and News AI: Blogs and News Gem Updates Gem Discoveries Digest Tweets
Twitter Lists Bluesky Lists RSS Lists Tracked Gems
Sign in Explore
@mli0603

Max Li 李赵硕

@mli0603

I've been debugging RoPE recently and kept getting tripped up by details that most explanations gloss over. So I wrote a deep dive. "Understanding RoPE: From Rotary Embeddings to Context Extension" https://mli0603.notion.site/Understanding-RoPE-From-Rotary-Embeddings-to-Context-Extension-316a341372738155a914f861a26c29d7 The blog covers: • Full RoPE derivation from rotation matrices • A clean proof of why RoPE's attention decays with distance (and when it breaks) • The π boundary (RoPE's Nyquist limit) • NTK-aware scaling derivation • Dynamic NTK • YaRN's frequency ramp + attention scaling • Reference PyTorch code Hope it helps! Feedback welcome!

mli0603.notion.site

Understanding RoPE: From Rotary Embeddings to Context Extension | Notion

TL;DR: RoPE encodes position via 2D rotations at geometrically spaced frequencies. The RoPE base sets a hard lower bound on effective context length — beyond it, the model literally prefers random tokens over similar ones. NTK-aware scaling extends context by changing the base (concentrating scaling on low-frequency dimensions); YaRN refines this with explicit frequency partitioning and attention scaling. This post derives everything from scratch, including a clean proof of why RoPE's attention decays with distance.

mli0603.notion.site

9:40 AM · Mar 1, 2026