Aidan Ewart

Featured in:

Articles

Sparse Autoencoders Find Highly Interpretable Directions in Language Models — LessWrong

Sep 21, 2023 | lesswrong.com | Logan Riggs |Aidan Ewart |Scott Emmons |Charlie Steiner

This is a linkpost for Sparse Autoencoders Find Highly Interpretable Directions in Language ModelsWe use a scalable and unsupervised method called Sparse Autoencoders to find interpretable, monosemantic features in real LLMs (Pythia-70M/410M) for both residual stream and MLPs. We showcase monosemantic features, feature replacement for Indirect Object Identification (IOI), and use OpenAI's automatic interpretation protocol to demonstrate a significant improvement in interpretability.

Try JournoFinder For Free

Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.