
Rohin Shah
Articles
-
Jan 23, 2025 |
lesswrong.com | Seb Farquhar |David H. Lindner |Rohin Shah
Blog post by Sebastian Farquhar, David Lindner, Rohin Shah. It discusses the paper MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking by Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb Biddulph, Ian Goodfellow, and Rohin Shah. Our paper tries to make agents that are safer in ways that we may not be able to evaluate through Myopic Optimization with Non-myopic Approval (MONA).
-
Jan 16, 2025 |
businessage.com | Rohin Shah
After 35 years as a chartered surveyor navigating the global property market, travelling back and forth from the UK to India regularly and running a business based out of London, you could say I’ve been around the block (pun very much intended). But, most importantly, I have learnt a lot along the way. Firstly, to be successful at what you do, you have to love what you do. Passion is the foundation of everything. When you’re truly invested in your work, it doesn’t feel like work.
-
Oct 15, 2024 |
lesswrong.com | Zach Stein-Perlman |Rohin Shah
Canonical linkpost: https://www.lesswrong.com/posts/Q7caj7emnwWBxLECF/anthropic-s-updated-responsible-scaling-policy. I haven't yet formed an opinion on the key questions, including are the thresholds and mitigations reasonable and adequate. I'll edit this post later. Anthropic's new version of its RSP is here at last.
-
Aug 20, 2024 |
lesswrong.com | Rohin Shah |Seb Farquhar |Anca Dragan
We wanted to share a recap of our recent outputs with the AF community. Below, we fill in some details about what we have been working on, what motivated us to do it, and how we thought about its importance. We hope that this will help people build off things we have done and see how their work fits with ours. We’re the main team at Google DeepMind working on technical approaches to existential risk from AI systems.
-
Mar 18, 2024 |
lesswrong.com | Neel Nanda |János Kramár |Tom Lieberum |Rohin Shah
Authors: János Kramár, Tom Lieberum, Rohin Shah, Neel NandaA new paper from the Google DeepMind mechanistic interpretability team, from core contributors János Kramár and Tom LieberumTweet thread summary, paperAbstract:Activation Patching is a method of directly computing causal attributions of behavior to model components.
Try JournoFinder For Free
Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.
Start Your 7-Day Free Trial →