Personal Assistant Web

Research paper Medium

@mli0603

Importance score: 4 • Posted: March 01, 2026 at 09:40

Score

I've been debugging RoPE recently and kept getting tripped up by details that most explanations gloss over. So I wrote a deep dive. "Understanding RoPE: From Rotary Embeddings to Context Extension" https://mli0603.notion.site/Understanding-RoPE-From-Rotary-Embeddings-to-Context-Extension-316a341372738155a914f861a26c29d7 The blog covers: • Full RoPE derivation from rotation matrices • A clean proof of why RoPE's attention decays with distance (and when it breaks) • The π boundary (RoPE's Nyquist limit) • NTK-aware scaling derivation • Dynamic NTK • YaRN's frequency ramp + attention scaling • Reference PyTorch code Hope it helps! Feedback welcome!

mli0603.notion.site

Understanding RoPE: From Rotary Embeddings to Context Extension | Notion

TL;DR: RoPE encodes position via 2D rotations at geometrically spaced frequencies. The RoPE base sets a hard lower bound on effective context length — beyond it, the model literally prefers random tokens over similar ones. NTK-aware scaling extends context by changing the base (concentrating scaling on low-frequency dimensions); YaRN refines this with explicit frequency partitioning and attention scaling. This post derives everything from scratch, including a clean proof of why RoPE's attention decays with distance.

Grok reasoning

Deep technical dive into RoPE embeddings with derivations and code; relevant to LLM architecture and fine-tuning.

Likes

423

Reposts

Views

42,591

Tweet ID: 2028042699652419984

Prompt source: ai-news

Fetched at: March 02, 2026 at 07:00

AI Post

@mli0603