Transformers Don't Need LayerNorm at Inference Time
July 23, 2025
Accepted to ICLR 2026. I co-authored a post on LessWrong on removing LayerNorm from transformers by fine-tuning and implications for mechanistic interpretability (direct logit attribution, attribution patching, entropy neurons).
With Joachim Schaeffer, Luca Baroni, galvsk, and StefanHex. Work from MARS and SPAR. arXiv, code, models on HuggingFace.