
Robert Krzyzanowski
Articles
-
Jul 18, 2024 |
lesswrong.com | Connor Kissane |Arthur Conmy |Neel Nanda |Robert Krzyzanowski
This is an interim report sharing preliminary results that we are currently building on. We hope this update will be useful to related research occurring in parallel. Executive SummaryWe train SAEs on base / chat model pairs and find that SAEs trained on the base model transfer surprisingly well to reconstructing chat activations (and vice versa) on Mistral-7B and Qwen 1.5 0.5B.
-
Jun 21, 2024 |
lesswrong.com | Connor Kissane |Arthur Conmy |Neel Nanda |Robert Krzyzanowski
This is the final post of our Alignment Forum sequence produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort. Executive SummaryIn a previous post we trained Attention Output SAEs on every layer of GPT-2 Small. Following that work, we wanted to stress-test that Attention SAEs were genuinely helpful for circuit analysis research.
-
Jan 16, 2024 |
lesswrong.com | Connor Kissane |Arthur Conmy |Neel Nanda |Robert Krzyzanowski
This post is the result of a 2 week research sprint project during the training phase of Neel Nanda’s MATS stream. Executive SummaryWe replicate Anthropic's MLP Sparse Autoencoder (SAE) paper on attention outputs and it works well: the SAEs learn sparse, interpretable features, which gives us insight into what attention layers learn. We study the second attention layer of a two layer language model (with MLPs).
Try JournoFinder For Free
Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.
Start Your 7-Day Free Trial →