
Nicholas Goldowsky-Dill
Articles
-
Sep 12, 2024 |
lesswrong.com | Zach Stein-Perlman |Nicholas Goldowsky-Dill
-
May 20, 2024 |
lesswrong.com | Marius Hobbhahn |Lucius Bushnaq |Dan Braun |Nicholas Goldowsky-Dill
This is a linkpost for our two recent papers, produced at Apollo Research in collaboration with Kaarel Hanni (Cadenza Labs), Avery Griffin, Joern Stoehler, Magdalena Wache and Cindy Wu:An exploration of using degeneracy in the loss landscape for interpretability https://arxiv.org/abs/2405.10927An empirical test of an interpretability technique based on the loss landscape https://arxiv.org/abs/2405.10928Not to be confused with Apollo's recent Sparse Dictionary Learning paper.
-
May 17, 2024 |
lesswrong.com | Dan Braun |Jordan Taylor |Lee Sharkey |Nicholas Goldowsky-Dill
A short summary of the paper is presented below. This work was produced by Apollo Research in collaboration with Jordan Taylor (MATS + University of Queensland) . TL;DR: We propose end-to-end (e2e) sparse dictionary learning, a method for training SAEs that ensures the features learned are functionally important by minimizing the KL divergence between the output distributions of the original model and the model with SAE activations inserted.
Try JournoFinder For Free
Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.
Start Your 7-Day Free Trial →