Jordan Taylor

Featured in:

Articles

When do alignment researchers retire? — LessWrong

Jun 25, 2024 | lesswrong.com | Jordan Taylor

At what point will it no longer be useful for humans to be involved in the process of alignment research? After the first slightly-superhuman AGI, well into superintelligence, or somewhere in between?
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning — LessWrong

May 17, 2024 | lesswrong.com | Dan Braun |Jordan Taylor |Lee Sharkey |Nicholas Goldowsky-Dill

A short summary of the paper is presented below. This work was produced by Apollo Research in collaboration with Jordan Taylor (MATS + University of Queensland) . TL;DR: We propose end-to-end (e2e) sparse dictionary learning, a method for training SAEs that ensures the features learned are functionally important by minimizing the KL divergence between the output distributions of the original model and the model with SAE activations inserted.
Graphical tensor notation for interpretability — LessWrong

Oct 4, 2023 | lesswrong.com | Jordan Taylor
A list of core AI safety problems and how I hope to solve them — LessWrong

Aug 26, 2023 | lesswrong.com | Daniel Murfet |Wei Dai |Roman Leventov |Jordan Taylor

Context: I sometimes find myself referring back to this tweet and wanted to give it a more permanent home. While I'm at it, I thought I would try to give a concise summary of how each distinct problem would be solved by Safeguarded AI (formerly known as an Open Agency Architecture, or OAA), if it turns out to be feasible. 1. Value is fragile and hard to specify.