Transformers Don't Need LayerNorm at Inference Time
Accepted to ICLR 2026. I co-authored a post on LessWrong on removing LayerNorm from transformers by fine-tuning and implications for mechanistic interpretability (direct logit attribution, attribution patching, entropy neurons).