Neel Nanda's profile photo

Neel Nanda

Featured in: Favicon arxiv.org

Articles

  • Nov 14, 2024 | lesswrong.com | Daniel Tan |Dmitrii Kharlapenko |Neel Nanda |Arthur Conmy

    TLDR:Fluent dreaming for language models is an algorithm based on the GCG method that can reliably find plain-text readable prompts for LLMs that maximize certain logits or residual stream directions by using gradients and genetic algorithms. Authors showed its use for visualizing MLP neurons. We show this method can also help interpret SAE features.

  • Nov 7, 2024 | lesswrong.com | Connor Kissane |Neel Nanda |Arthur Conmy

    This is an interim report sharing preliminary results. We hope this update will be useful to related research occurring in parallel. Executive SummaryProblem: Qwen1.5 0.5B Chat SAEs trained on the pile (webtext) fail to find sparse, interpretable reconstructions of the refusal direction from Arditi et al. The most refusal-related latent we find is coarse grained and underperforms the refusal direction at steering tasks. This is disappointing. The point of an SAE is to find meaningful concepts.

  • Nov 1, 2024 | lesswrong.com | Senthooran Rajamanoharan |Neel Nanda |Subhash Kantamneni

    Subhash and Josh are co-first authors.

  • Oct 27, 2024 | lesswrong.com | Connor Kissane |Arthur Conmy |Neel Nanda

    IntroAnthropic recently released an exciting mini-paper on crosscoders (Lindsey et al.). In this post, we open source a model-diffing crosscoder trained on the middle layer residual stream of the Gemma-2 2B base and IT models, along with code, / tips, and a replication of the core results in Anthropic’s paper. While Anthropic highlights several potential applications of crosscoders, in this post we focus solely on “model-diffing”.

  • Oct 12, 2024 | lesswrong.com | Dmitrii Kharlapenko |Arthur Conmy |Neel Nanda

    Steering vectors provide evidence that linear directions in LLMs are interpretable. Since SAEs decompose linear directions, they should be able to interpret steering vectors. We apply the gradient pursuit algorithm suggested by  to decompose steering vectors, and find that they contain many interpretable and promising-looking features. This builds off our prior work, which applies ITO and derivatives to steering vectors with less success.

Contact details

Socials & Sites

Try JournoFinder For Free

Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.

Start Your 7-Day Free Trial →