Robert Krzyzanowski

Featured in:

Articles

SAEs (usually) Transfer Between Base and Chat Models — LessWrong

Jul 18, 2024 | lesswrong.com | Connor Kissane |Arthur Conmy |Neel Nanda |Robert Krzyzanowski

This is an interim report sharing preliminary results that we are currently building on. We hope this update will be useful to related research occurring in parallel. Executive SummaryWe train SAEs on base / chat model pairs and find that SAEs trained on the base model transfer surprisingly well to reconstructing chat activations (and vice versa) on Mistral-7B and Qwen 1.5 0.5B.
Attention Output SAEs Improve Circuit Analysis — LessWrong

Jun 21, 2024 | lesswrong.com | Connor Kissane |Arthur Conmy |Neel Nanda |Robert Krzyzanowski

This is the final post of our Alignment Forum sequence produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort. Executive SummaryIn a previous post we trained Attention Output SAEs on every layer of GPT-2 Small. Following that work, we wanted to stress-test that Attention SAEs were genuinely helpful for circuit analysis research.
Sparse Autoencoders Work on Attention Layer Outputs — LessWrong

Jan 16, 2024 | lesswrong.com | Connor Kissane |Arthur Conmy |Neel Nanda |Robert Krzyzanowski

This post is the result of a 2 week research sprint project during the training phase of Neel Nanda’s MATS stream. Executive SummaryWe replicate Anthropic's MLP Sparse Autoencoder (SAE) paper on attention outputs and it works well: the SAEs learn sparse, interpretable features, which gives us insight into what attention layers learn. We study the second attention layer of a two layer language model (with MLPs).

Contact details

Emails

[email protected]

Socials & Sites

Try JournoFinder For Free

Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.

Start Your 7-Day Free Trial →

Robert Krzyzanowski

Articles

SAEs (usually) Transfer Between Base and Chat Models — LessWrong

Attention Output SAEs Improve Circuit Analysis — LessWrong

Sparse Autoencoders Work on Attention Layer Outputs — LessWrong

Contact details

Emails

Socials & Sites