Seb Farquhar

Featured in:

Articles

MONA: Managed Myopia with Approval Feedback — LessWrong

Jan 23, 2025 | lesswrong.com | Seb Farquhar |David H. Lindner |Rohin Shah

Blog post by Sebastian Farquhar, David Lindner, Rohin Shah. It discusses the paper MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking by Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb Biddulph, Ian Goodfellow, and Rohin Shah. Our paper tries to make agents that are safer in ways that we may not be able to evaluate through Myopic Optimization with Non-myopic Approval (MONA).
AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work — LessWrong

Aug 20, 2024 | lesswrong.com | Rohin Shah |Seb Farquhar |Anca Dragan

We wanted to share a recap of our recent outputs with the AF community. Below, we fill in some details about what we have been working on, what motivated us to do it, and how we thought about its importance. We hope that this will help people build off things we have done and see how their work fits with ours. We’re the main team at Google DeepMind working on technical approaches to existential risk from AI systems.
Discussion: Challenges with Unsupervised LLM Knowledge Discovery — LessWrong

Dec 18, 2023 | lesswrong.com | Vikrant Varma |Vlad Mikulik |Rohin Shah |Seb Farquhar

TL;DR: Contrast-consistent search (CCS) seemed exciting to us and we were keen to apply it. At this point, we think it is unlikely to be directly helpful for implementations of alignment strategies (>95%). Instead of finding knowledge, it seems to find the most prominent feature. We are less sure about the wider category of unsupervised consistency-based methods, but tend to think they won’t be directly helpful either (70%). We’ve written a paper about some of our detailed experiences with it.

Contact details

Emails

[email protected]

Socials & Sites

Try JournoFinder For Free

Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.

Start Your 7-Day Free Trial →

Seb Farquhar

Articles

MONA: Managed Myopia with Approval Feedback — LessWrong

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work — LessWrong

Discussion: Challenges with Unsupervised LLM Knowledge Discovery — LessWrong

Contact details

Emails

Socials & Sites