
Connor Kissane
Articles
-
Nov 7, 2024 |
lesswrong.com | Connor Kissane |Neel Nanda |Arthur Conmy
This is an interim report sharing preliminary results. We hope this update will be useful to related research occurring in parallel. Executive SummaryProblem: Qwen1.5 0.5B Chat SAEs trained on the pile (webtext) fail to find sparse, interpretable reconstructions of the refusal direction from Arditi et al. The most refusal-related latent we find is coarse grained and underperforms the refusal direction at steering tasks. This is disappointing. The point of an SAE is to find meaningful concepts.
-
Oct 27, 2024 |
lesswrong.com | Connor Kissane |Arthur Conmy |Neel Nanda
IntroAnthropic recently released an exciting mini-paper on crosscoders (Lindsey et al.). In this post, we open source a model-diffing crosscoder trained on the middle layer residual stream of the Gemma-2 2B base and IT models, along with code, / tips, and a replication of the core results in Anthropic’s paper. While Anthropic highlights several potential applications of crosscoders, in this post we focus solely on “model-diffing”.
-
Sep 29, 2024 |
lesswrong.com | Connor Kissane |Arthur Conmy |Neel Nanda
Executive SummaryRefusing harmful requests is not a novel behavior learned in chat fine-tuning, as pre-trained base models will also refuse requests (48% of all harmful requests, 3% of harmless) just at a lower rate than chat models (90% harmful, 3% harmless)Further, for both Qwen 1.5 0.5B and Gemma 2 9B, chat fine-tuning reinforces the existing mechanisms. In both the chat and base models it is mediated by the refusal direction described in Arditi et al.
-
Jul 18, 2024 |
lesswrong.com | Connor Kissane |Arthur Conmy |Neel Nanda |Robert Krzyzanowski
This is an interim report sharing preliminary results that we are currently building on. We hope this update will be useful to related research occurring in parallel. Executive SummaryWe train SAEs on base / chat model pairs and find that SAEs trained on the base model transfer surprisingly well to reconstructing chat activations (and vice versa) on Mistral-7B and Qwen 1.5 0.5B.
-
Jun 21, 2024 |
lesswrong.com | Connor Kissane |Arthur Conmy |Neel Nanda |Robert Krzyzanowski
This is the final post of our Alignment Forum sequence produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort. Executive SummaryIn a previous post we trained Attention Output SAEs on every layer of GPT-2 Small. Following that work, we wanted to stress-test that Attention SAEs were genuinely helpful for circuit analysis research.
Try JournoFinder For Free
Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.
Start Your 7-Day Free Trial →