
Aidan Ewart
Featured in:
Articles
-
Sep 21, 2023 |
lesswrong.com | Logan Riggs |Aidan Ewart |Scott Emmons |Charlie Steiner
This is a linkpost for Sparse Autoencoders Find Highly Interpretable Directions in Language ModelsWe use a scalable and unsupervised method called Sparse Autoencoders to find interpretable, monosemantic features in real LLMs (Pythia-70M/410M) for both residual stream and MLPs. We showcase monosemantic features, feature replacement for Indirect Object Identification (IOI), and use OpenAI's automatic interpretation protocol to demonstrate a significant improvement in interpretability.
Try JournoFinder For Free
Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.
Start Your 7-Day Free Trial →