
Callum McDougall
Articles
-
Apr 1, 2024 |
lesswrong.com | Callum McDougall |Joseph Bloom
Epistemic status - self-evident. In this post, we interpret a small sample of Sparse Autoencoder features which reveal meaningful computational structure in the model that is clearly highly researcher-independent and of significant relevance to AI alignment. Recent excitement about Sparse Autoencoders (SAEs) has been mired by the following question: Do SAE features reflect properties of the model, or just capture correlational structure in the underlying data distribution?
-
Mar 31, 2024 |
lesswrong.com | Callum McDougall |Joseph Bloom
This is a post to officially announce the sae-vis library, which was designed to create feature dashboards like those from Anthropic's research. There are 2 types of visualisations supported by this library: feature-centric and prompt-centric. The feature-centric vis is the standard from Anthropic’s post, it looks like the image below. There’s an option to navigate through different features via a dropdown in the top left.
-
Nov 29, 2023 |
lesswrong.com | Callum McDougall |James Dao
This is a linkpost for some exercises in sparse autoencoders, which I've recently finished working on as part of the upcoming ARENA 3.0 iteration. Having spoken to Neel Nanda and others in interpretability-related MATS streams, it seemed useful to make these exercises accessible out of the context of the rest of the ARENA curriculum. Links to Colabs (updated): Exercises, Solutions.
Try JournoFinder For Free
Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.
Start Your 7-Day Free Trial →