Transformers Don't Need LayerNorm at Inference Time

July 23, 2025

Accepted to ICLR 2026. I co-authored a post on LessWrong on removing LayerNorm from transformers by fine-tuning and implications for mechanistic interpretability (direct logit attribution, attribution patching, entropy neurons).

→ Read on LessWrong

With Joachim Schaeffer, Luca Baroni, galvsk, and StefanHex. Work from MARS and SPAR. arXiv, code, models on HuggingFace.