Callum McDougall

Featured in:

Articles

A Selection of Randomly Selected SAE Features — LessWrong

Apr 1, 2024 | lesswrong.com | Callum McDougall |Joseph Bloom

Epistemic status - self-evident. In this post, we interpret a small sample of Sparse Autoencoder features which reveal meaningful computational structure in the model that is clearly highly researcher-independent and of significant relevance to AI alignment. Recent excitement about Sparse Autoencoders (SAEs) has been mired by the following question: Do SAE features reflect properties of the model, or just capture correlational structure in the underlying data distribution?
SAE-VIS: Announcement Post — LessWrong

Mar 31, 2024 | lesswrong.com | Callum McDougall |Joseph Bloom

This is a post to officially announce the sae-vis library, which was designed to create feature dashboards like those from Anthropic's research. There are 2 types of visualisations supported by this library: feature-centric and prompt-centric. The feature-centric vis is the standard from Anthropic’s post, it looks like the image below. There’s an option to navigate through different features via a dropdown in the top left.
Intro to Superposition & Sparse Autoencoders (Colab exercises) — LessWrong

Nov 29, 2023 | lesswrong.com | Callum McDougall |James Dao

This is a linkpost for some exercises in sparse autoencoders, which I've recently finished working on as part of the upcoming ARENA 3.0 iteration. Having spoken to Neel Nanda and others in interpretability-related MATS streams, it seemed useful to make these exercises accessible out of the context of the rest of the ARENA curriculum. Links to Colabs (updated): Exercises, Solutions.

Contact details

Emails

[email protected]

Socials & Sites

Try JournoFinder For Free

Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.

Start Your 7-Day Free Trial →

Callum McDougall

Articles

A Selection of Randomly Selected SAE Features — LessWrong

SAE-VIS: Announcement Post — LessWrong

Intro to Superposition & Sparse Autoencoders (Colab exercises) — LessWrong

Contact details

Emails

Socials & Sites