Arthur Conmy

Featured in: Favicon

flyertalk.com

Articles

Evolutionary prompt optimization for SAE feature visualization — LessWrong

Nov 14, 2024 | lesswrong.com | Daniel Tan |Dmitrii Kharlapenko |Neel Nanda |Arthur Conmy

TLDR:Fluent dreaming for language models is an algorithm based on the GCG method that can reliably find plain-text readable prompts for LLMs that maximize certain logits or residual stream directions by using gradients and genetic algorithms. Authors showed its use for visualizing MLP neurons. We show this method can also help interpret SAE features.
SAEs are highly dataset dependent: a case study on the refusal direction — LessWrong

Nov 7, 2024 | lesswrong.com | Connor Kissane |Neel Nanda |Arthur Conmy

This is an interim report sharing preliminary results. We hope this update will be useful to related research occurring in parallel. Executive SummaryProblem: Qwen1.5 0.5B Chat SAEs trained on the pile (webtext) fail to find sparse, interpretable reconstructions of the refusal direction from Arditi et al. The most refusal-related latent we find is coarse grained and underperforms the refusal direction at steering tasks. This is disappointing. The point of an SAE is to find meaningful concepts.
Open Source Replication of Anthropic’s Crosscoder paper for model-diffing — LessWrong

Oct 27, 2024 | lesswrong.com | Connor Kissane |Arthur Conmy |Neel Nanda

IntroAnthropic recently released an exciting mini-paper on crosscoders (Lindsey et al.). In this post, we open source a model-diffing crosscoder trained on the middle layer residual stream of the Gemma-2 2B base and IT models, along with code, / tips, and a replication of the core results in Anthropic’s paper. While Anthropic highlights several potential applications of crosscoders, in this post we focus solely on “model-diffing”.
SAE features for refusal and sycophancy steering vectors — LessWrong

Oct 12, 2024 | lesswrong.com | Dmitrii Kharlapenko |Arthur Conmy |Neel Nanda

Steering vectors provide evidence that linear directions in LLMs are interpretable. Since SAEs decompose linear directions, they should be able to interpret steering vectors. We apply the gradient pursuit algorithm suggested by to decompose steering vectors, and find that they contain many interpretable and promising-looking features. This builds off our prior work, which applies ITO and derivatives to steering vectors with less success.
Base LLMs refuse too — LessWrong

Sep 29, 2024 | lesswrong.com | Connor Kissane |Arthur Conmy |Neel Nanda

Executive SummaryRefusing harmful requests is not a novel behavior learned in chat fine-tuning, as pre-trained base models will also refuse requests (48% of all harmful requests, 3% of harmless) just at a lower rate than chat models (90% harmful, 3% harmless)Further, for both Qwen 1.5 0.5B and Gemma 2 9B, chat fine-tuning reinforces the existing mechanisms. In both the chat and base models it is mediated by the refusal direction described in Arditi et al.
Extracting SAE task features for in-context learning — LessWrong

Aug 12, 2024 | lesswrong.com | Dmitrii Kharlapenko |Neel Nanda |Arthur Conmy

We try to study task vectors in the SAE basis. This is challenging because there is no canonical way to convert an arbitrary vector in the residual stream to a linear combination of SAE features — you can't just pass an arbitrary vector through the encoder without going off distribution. We explored the algorithm of gradient pursuit suggested in , but it didn’t work for us without modifications. Our approach is to apply the SAE encoder to the task vector, and then apply a gradient-based cleanup.
Extracting SAE task features for ICL — LessWrong

Aug 12, 2024 | lesswrong.com | Dmitrii Kharlapenko |Neel Nanda |Arthur Conmy

We try to study task vectors in the SAE basis. This is challenging because there is no canonical way to convert an arbitrary vector in the residual stream to a linear combination of SAE features — you can't just pass an arbitrary vector through the encoder without going off distribution. We explored the algorithm of gradient pursuit suggested in , but it didn’t work for us without modifications. Our approach is to apply the SAE encoder to the task vector, and then apply a gradient-based cleanup.
Self-explaining SAE features — LessWrong

Aug 5, 2024 | lesswrong.com | Dmitrii Kharlapenko |Neel Nanda |Arthur Conmy

We apply the method of SelfIE/Patchscopes to explain SAE features – we give the model a prompt like “What does X mean?”, replace the residual stream on X with the decoder direction times some scale, and have it generate an explanation. We call this self-explanation. The natural alternative is auto-interp, using a larger LLM to spot patterns in max activating examples.
SAEs (usually) Transfer Between Base and Chat Models — LessWrong

Jul 18, 2024 | lesswrong.com | Connor Kissane |Arthur Conmy |Neel Nanda |Robert Krzyzanowski

This is an interim report sharing preliminary results that we are currently building on. We hope this update will be useful to related research occurring in parallel. Executive SummaryWe train SAEs on base / chat model pairs and find that SAEs trained on the base model transfer surprisingly well to reconstructing chat activations (and vice versa) on Mistral-7B and Qwen 1.5 0.5B.
Attention Output SAEs Improve Circuit Analysis — LessWrong

Jun 21, 2024 | lesswrong.com | Connor Kissane |Arthur Conmy |Neel Nanda |Robert Krzyzanowski

This is the final post of our Alignment Forum sequence produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort. Executive SummaryIn a previous post we trained Attention Output SAEs on every layer of GPT-2 Small. Following that work, we wanted to stress-test that Attention SAEs were genuinely helpful for circuit analysis research.