Paul Colognese

Featured in:

Articles

Explaining the AI Alignment Problem to Tibetan Buddhist Monks — LessWrong

Mar 7, 2024 | lesswrong.com | Paul Colognese

As part of an exchange being facilitated between religion and science, a group of academics has been asked to compile a short description of their greatest scientific achievement/discovery that will be translated into Tibetan and presented to Tibetan Buddhist scholars/monks. I was also invited to contribute, but I sort of ignored the instruction and decided to present an introduction to the AI Alignment Problem instead.
Anomalous Concept Detection for Detecting Hidden Cognition — LessWrong

Mar 4, 2024 | lesswrong.com | Paul Colognese

Thanks to Johannes Treutlein, Erik Jenner, Joseph Bloom, and Arun Jose for their discussions and feedback. Monitoring an AI’s internals for features/concepts unrelated to the task the AI appears to be performing may help detect when the AI is performing hidden cognition. For example, it would be very suspicious if an AI tasked with booking a holiday for a user was found to be using internal features/concepts related to “bioweapons” or “viruses”.
Hidden Cognition Detection Methods and Benchmarks — LessWrong

Feb 26, 2024 | lesswrong.com | Paul Colognese |Gerald M. Monroe

Thanks to Johannes Treutlein for discussions and feedback. An AI may be able to hide cognition that leads to negative outcomes from certain oversight processes (such as deceptive alignment/scheming). Without being able to detect this hidden cognition, an overseer may not be able to prevent the associated negative outcomes or include this information as part of the training signal.
Notes on Internal Objectives in Toy Models of Agents — LessWrong

Feb 22, 2024 | lesswrong.com | Paul Colognese

Thanks to Jeremy Gillen and Arun Jose for discussions related to these ideas. WARNING: The quality of this post is low. It was sitting in my drafts folder for a while, yet I decided to post it because some people found these examples and analyses helpful in conversations. I tidied up the summary, deleted some sections, and added warnings related to parts of the post that could be confusing.
Charbel-Raphaël and Lucius discuss Interpretability — LessWrong

Oct 30, 2023 | lesswrong.com | Mateusz Bagiński |Lucius Bushnaq |Paul Colognese |Roman Leventov

The following is a transcript of a public discussion between Charbel-Raphaël Segerie and Lucius Bushnaq that took place on 23 September during LessWrong Community Weekend 2023. I edited the transcript for clarity. TranscriptMateusz: Last month, Charbel published the post Against Almost Every Theory of Impact of Interpretability, which sparked a lot of discussion in the community. Lucius is an AI notkilleveryoneism researcher at Apollo, focused on interpretability. He disagrees with Charbel.
High-level interpretability: detecting an AI's objectives — LessWrong

Sep 28, 2023 | lesswrong.com | Paul Colognese |Jonas Hallgren |Nora Belrose

Thanks to Monte MacDiarmid (for discussions, feedback, and experiment infrastructure) and to the Shard Theory team for their prior work and exploratory infrastructure. Thanks to Joseph Bloom, John Wentworth, Alexander Gietelink Oldenziel, Johannes Treuitlein, Marius Hobbhahn, Jeremy Gillen, Bilal Chughtai, Evan Hubinger, Rocket Drew, Tassilo Neubauer, Jan Betley, and Juliette Culver for discussions/feedback. This is a brief overview of our research agenda, recent progress, and future objectives.