Connor Kissane

Featured in:

Articles

SAEs are highly dataset dependent: a case study on the refusal direction — LessWrong

Nov 7, 2024 | lesswrong.com | Connor Kissane |Neel Nanda |Arthur Conmy

This is an interim report sharing preliminary results. We hope this update will be useful to related research occurring in parallel. Executive SummaryProblem: Qwen1.5 0.5B Chat SAEs trained on the pile (webtext) fail to find sparse, interpretable reconstructions of the refusal direction from Arditi et al. The most refusal-related latent we find is coarse grained and underperforms the refusal direction at steering tasks. This is disappointing. The point of an SAE is to find meaningful concepts.
Open Source Replication of Anthropic’s Crosscoder paper for model-diffing — LessWrong

Oct 27, 2024 | lesswrong.com | Connor Kissane |Arthur Conmy |Neel Nanda

IntroAnthropic recently released an exciting mini-paper on crosscoders (Lindsey et al.). In this post, we open source a model-diffing crosscoder trained on the middle layer residual stream of the Gemma-2 2B base and IT models, along with code, / tips, and a replication of the core results in Anthropic’s paper. While Anthropic highlights several potential applications of crosscoders, in this post we focus solely on “model-diffing”.
Base LLMs refuse too — LessWrong

Sep 29, 2024 | lesswrong.com | Connor Kissane |Arthur Conmy |Neel Nanda

Executive SummaryRefusing harmful requests is not a novel behavior learned in chat fine-tuning, as pre-trained base models will also refuse requests (48% of all harmful requests, 3% of harmless) just at a lower rate than chat models (90% harmful, 3% harmless)Further, for both Qwen 1.5 0.5B and Gemma 2 9B, chat fine-tuning reinforces the existing mechanisms. In both the chat and base models it is mediated by the refusal direction described in Arditi et al.
SAEs (usually) Transfer Between Base and Chat Models — LessWrong

Jul 18, 2024 | lesswrong.com | Connor Kissane |Arthur Conmy |Neel Nanda |Robert Krzyzanowski

This is an interim report sharing preliminary results that we are currently building on. We hope this update will be useful to related research occurring in parallel. Executive SummaryWe train SAEs on base / chat model pairs and find that SAEs trained on the base model transfer surprisingly well to reconstructing chat activations (and vice versa) on Mistral-7B and Qwen 1.5 0.5B.
Attention Output SAEs Improve Circuit Analysis — LessWrong

Jun 21, 2024 | lesswrong.com | Connor Kissane |Arthur Conmy |Neel Nanda |Robert Krzyzanowski

This is the final post of our Alignment Forum sequence produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort. Executive SummaryIn a previous post we trained Attention Output SAEs on every layer of GPT-2 Small. Following that work, we wanted to stress-test that Attention SAEs were genuinely helpful for circuit analysis research.
We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To — LessWrong

Mar 6, 2024 | lesswrong.com | Connor Kissane |Arthur Conmy |Neel Nanda

This is an interim report that we are currently building on. We hope this update will be useful to related research occurring in parallel. Produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 CohortExecutive SummaryIn a previous post we trained attention SAEs on every layer of GPT-2 Small and we found that a majority of features are interpretable in all layers. We’ve since leveraged our SAEs as a tool to explore individual attention heads through the lens of SAE features.
Sparse Autoencoders Work on Attention Layer Outputs — LessWrong

Jan 16, 2024 | lesswrong.com | Connor Kissane |Arthur Conmy |Neel Nanda |Robert Krzyzanowski

This post is the result of a 2 week research sprint project during the training phase of Neel Nanda’s MATS stream. Executive SummaryWe replicate Anthropic's MLP Sparse Autoencoder (SAE) paper on attention outputs and it works well: the SAEs learn sparse, interpretable features, which gives us insight into what attention layers learn. We study the second attention layer of a two layer language model (with MLPs).

Contact details

Emails

[email protected]

Socials & Sites

Try JournoFinder For Free

Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.

Start Your 7-Day Free Trial →

Connor Kissane

Articles

SAEs are highly dataset dependent: a case study on the refusal direction — LessWrong

Open Source Replication of Anthropic’s Crosscoder paper for model-diffing — LessWrong

Base LLMs refuse too — LessWrong

SAEs (usually) Transfer Between Base and Chat Models — LessWrong

Attention Output SAEs Improve Circuit Analysis — LessWrong

We Inspected Every Head In GPT-2 Small using SAEs So You Don’t Have To — LessWrong

Sparse Autoencoders Work on Attention Layer Outputs — LessWrong

Contact details

Emails

Socials & Sites