Paul Colognese's profile photo

Paul Colognese

Featured in:

Articles

  • Mar 7, 2024 | lesswrong.com | Paul Colognese

    As part of an exchange being facilitated between religion and science, a group of academics has been asked to compile a short description of their greatest scientific achievement/discovery that will be translated into Tibetan and presented to Tibetan Buddhist scholars/monks. I was also invited to contribute, but I sort of ignored the instruction and decided to present an introduction to the AI Alignment Problem instead.

  • Mar 4, 2024 | lesswrong.com | Paul Colognese

    Thanks to Johannes Treutlein, Erik Jenner, Joseph Bloom, and Arun Jose for their discussions and feedback. Monitoring an AI’s internals for features/concepts unrelated to the task the AI appears to be performing may help detect when the AI is performing hidden cognition. For example, it would be very suspicious if an AI tasked with booking a holiday for a user was found to be using internal features/concepts related to “bioweapons” or “viruses”.

  • Feb 26, 2024 | lesswrong.com | Paul Colognese |Gerald M. Monroe

    Thanks to Johannes Treutlein for discussions and feedback. An AI may be able to hide cognition that leads to negative outcomes from certain oversight processes (such as deceptive alignment/scheming). Without being able to detect this hidden cognition, an overseer may not be able to prevent the associated negative outcomes or include this information as part of the training signal.

  • Feb 22, 2024 | lesswrong.com | Paul Colognese

    Thanks to Jeremy Gillen and Arun Jose for discussions related to these ideas. WARNING: The quality of this post is low. It was sitting in my drafts folder for a while, yet I decided to post it because some people found these examples and analyses helpful in conversations. I tidied up the summary, deleted some sections, and added warnings related to parts of the post that could be confusing.

  • Oct 30, 2023 | lesswrong.com | Mateusz Bagiński |Lucius Bushnaq |Paul Colognese |Roman Leventov

    The following is a transcript of a public discussion between Charbel-Raphaël Segerie and Lucius Bushnaq that took place on 23 September during LessWrong Community Weekend 2023. I edited the transcript for clarity. TranscriptMateusz: Last month, Charbel published the post Against Almost Every Theory of Impact of Interpretability, which sparked a lot of discussion in the community. Lucius is an AI notkilleveryoneism researcher at Apollo, focused on interpretability. He disagrees with Charbel.

Contact details

Socials & Sites

Try JournoFinder For Free

Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.

Start Your 7-Day Free Trial →