
Seb Farquhar
Articles
-
Jan 23, 2025 |
lesswrong.com | Seb Farquhar |David H. Lindner |Rohin Shah
Blog post by Sebastian Farquhar, David Lindner, Rohin Shah. It discusses the paper MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking by Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb Biddulph, Ian Goodfellow, and Rohin Shah. Our paper tries to make agents that are safer in ways that we may not be able to evaluate through Myopic Optimization with Non-myopic Approval (MONA).
-
Aug 20, 2024 |
lesswrong.com | Rohin Shah |Seb Farquhar |Anca Dragan
We wanted to share a recap of our recent outputs with the AF community. Below, we fill in some details about what we have been working on, what motivated us to do it, and how we thought about its importance. We hope that this will help people build off things we have done and see how their work fits with ours. We’re the main team at Google DeepMind working on technical approaches to existential risk from AI systems.
-
Dec 18, 2023 |
lesswrong.com | Vikrant Varma |Vlad Mikulik |Rohin Shah |Seb Farquhar
TL;DR: Contrast-consistent search (CCS) seemed exciting to us and we were keen to apply it. At this point, we think it is unlikely to be directly helpful for implementations of alignment strategies (>95%). Instead of finding knowledge, it seems to find the most prominent feature. We are less sure about the wider category of unsupervised consistency-based methods, but tend to think they won’t be directly helpful either (70%). We’ve written a paper about some of our detailed experiences with it.
Try JournoFinder For Free
Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.
Start Your 7-Day Free Trial →