Nicholas Goldowsky-Dill

Featured in:

Articles

OpenAI o1 - LessWrong

Sep 12, 2024 | lesswrong.com | Zach Stein-Perlman |Nicholas Goldowsky-Dill
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks — LessWrong

May 20, 2024 | lesswrong.com | Marius Hobbhahn |Lucius Bushnaq |Dan Braun |Nicholas Goldowsky-Dill

This is a linkpost for our two recent papers, produced at Apollo Research in collaboration with Kaarel Hanni (Cadenza Labs), Avery Griffin, Joern Stoehler, Magdalena Wache and Cindy Wu:An exploration of using degeneracy in the loss landscape for interpretability https://arxiv.org/abs/2405.10927An empirical test of an interpretability technique based on the loss landscape https://arxiv.org/abs/2405.10928Not to be confused with Apollo's recent Sparse Dictionary Learning paper.
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning — LessWrong

May 17, 2024 | lesswrong.com | Dan Braun |Jordan Taylor |Lee Sharkey |Nicholas Goldowsky-Dill

A short summary of the paper is presented below. This work was produced by Apollo Research in collaboration with Jordan Taylor (MATS + University of Queensland) . TL;DR: We propose end-to-end (e2e) sparse dictionary learning, a method for training SAEs that ensures the features learned are functionally important by minimizing the KL divergence between the output distributions of the original model and the model with SAE activations inserted.

Try JournoFinder For Free

Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.