
Senthooran Rajamanoharan
Articles
-
Nov 1, 2024 |
lesswrong.com | Senthooran Rajamanoharan |Neel Nanda |Subhash Kantamneni
Subhash and Josh are co-first authors.
-
Jul 19, 2024 |
lesswrong.com | Neel Nanda |Senthooran Rajamanoharan |Tom Lieberum |Vikrant Varma
New paper from the Google DeepMind mechanistic interpretability team, led by Sen Rajamanoharan!We introduce JumpReLU SAEs, a new SAE architecture that replaces the standard ReLUs with discontinuous JumpReLU activations, and seems to be (narrowly) state of the art over existing methods like TopK and Gated SAEs for achieving high reconstruction at a given sparsity level, without a hit to interpretability.
-
Apr 25, 2024 |
lesswrong.com | Neel Nanda |Senthooran Rajamanoharan |Arthur Conmy |Tom Lieberum
(The question in this comment is more narrow and probably not interesting to most people.)The limitations section includes this paragraph:One worry about increasing the expressivity of sparse autoencoders is that they will overfit whenreconstructing activations (Olah et al., 2023, Dictionary Learning Worries), since the underlyingmodel only uses simple MLPs and attention heads, and in particular lacks discontinuities such as stepfunctions. Overall we do not see evidence for this.
-
Apr 19, 2024 |
lesswrong.com | Neel Nanda |Arthur Conmy |Senthooran Rajamanoharan |Tom Lieberum
This is a series of snippets about the Google DeepMind mechanistic interpretability team's research into Sparse Autoencoders, that didn't meet our bar for a full paper. Please start at the summary post for more context, and a summary of each snippet. They can be read in any order. Arthur Conmy, Neel NandaTL;DR: We use SAEs trained on GPT-2 XL’s residual stream to decompose steeringvectorsinto interpretable features.
-
Apr 19, 2024 |
lesswrong.com | Neel Nanda |Arthur Conmy |Tom Lieberum |Senthooran Rajamanoharan
This is a progress update from the Google DeepMind mechanistic interpretability team, inspired by the Anthropic team’s excellent monthly updates! Our goal was to write-up a series of snippets, covering a range of things that we thought would be interesting to the broader community, but didn't yet meet our bar for a paper. This is a mix of promising initial steps on larger investigations, write-ups of small investigations, replications, and negative results.
Try JournoFinder For Free
Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.
Start Your 7-Day Free Trial →