Rohin Shah

Featured in: Favicon

robohub.org Favicon

easterneye.biz Favicon

analyticsweek.com Favicon

businessage.com

Articles

MONA: Managed Myopia with Approval Feedback — LessWrong

Jan 23, 2025 | lesswrong.com | Seb Farquhar |David H. Lindner |Rohin Shah

Blog post by Sebastian Farquhar, David Lindner, Rohin Shah. It discusses the paper MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking by Sebastian Farquhar, Vikrant Varma, David Lindner, David Elson, Caleb Biddulph, Ian Goodfellow, and Rohin Shah. Our paper tries to make agents that are safer in ways that we may not be able to evaluate through Myopic Optimization with Non-myopic Approval (MONA).
Passion, Simplicity, and Trust: The Cornerstones of Business Success

Jan 16, 2025 | businessage.com | Rohin Shah

After 35 years as a chartered surveyor navigating the global property market, travelling back and forth from the UK to India regularly and running a business based out of London, you could say I’ve been around the block (pun very much intended). But, most importantly, I have learnt a lot along the way. Firstly, to be successful at what you do, you have to love what you do. Passion is the foundation of everything. When you’re truly invested in your work, it doesn’t feel like work.
Anthropic rewrote its RSP — LessWrong

Oct 15, 2024 | lesswrong.com | Zach Stein-Perlman |Rohin Shah

Canonical linkpost: https://www.lesswrong.com/posts/Q7caj7emnwWBxLECF/anthropic-s-updated-responsible-scaling-policy. I haven't yet formed an opinion on the key questions, including are the thresholds and mitigations reasonable and adequate. I'll edit this post later. Anthropic's new version of its RSP is here at last.
AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work — LessWrong

Aug 20, 2024 | lesswrong.com | Rohin Shah |Seb Farquhar |Anca Dragan

We wanted to share a recap of our recent outputs with the AF community. Below, we fill in some details about what we have been working on, what motivated us to do it, and how we thought about its importance. We hope that this will help people build off things we have done and see how their work fits with ours. We’re the main team at Google DeepMind working on technical approaches to existential risk from AI systems.
AtP*: An efficient and scalable method for localizing LLM behaviour to components — LessWrong

Mar 18, 2024 | lesswrong.com | Neel Nanda |János Kramár |Tom Lieberum |Rohin Shah

Authors: János Kramár, Tom Lieberum, Rohin Shah, Neel NandaA new paper from the Google DeepMind mechanistic interpretability team, from core contributors János Kramár and Tom LieberumTweet thread summary, paperAbstract:Activation Patching is a method of directly computing causal attributions of behavior to model components.
The case for ensuring that powerful AIs are controlled — LessWrong

Jan 24, 2024 | lesswrong.com | Feeping Creature |Seth Herd |Lukas Finnveden |Rohin Shah

I agree it helps to run experiments at small scales first, but I'd be pretty surprised if that helped to the point of enabling a 30x speedup -- that means that the AI labor allows you get 30x improvement in compute needed beyond what would be done by default by humans (though the 30x can include e.g. improving utilization, it's not limited just to making individual experiments take less time).
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1) — LessWrong

Dec 23, 2023 | lesswrong.com | Neel Nanda |János Kramár |Rohin Shah

This is great work, even though you weren't able to understand the memorization mechanistically. I agree that a big part of the reason to be pessimistic about ambitious mechanistic interp is that even very large neural networks are performing some amount of pure memorization. For example, existing LMs often can regurgitate some canary strings, which really seems like a case without any (to use your phrase) macrofeatures.
Discussion: Challenges with Unsupervised LLM Knowledge Discovery — LessWrong

Dec 18, 2023 | lesswrong.com | Vikrant Varma |Vlad Mikulik |Rohin Shah |Seb Farquhar

TL;DR: Contrast-consistent search (CCS) seemed exciting to us and we were keen to apply it. At this point, we think it is unlikely to be directly helpful for implementations of alignment strategies (>95%). Instead of finding knowledge, it seems to find the most prominent feature. We are less sure about the wider category of unsupervised consistency-based methods, but tend to think they won’t be directly helpful either (70%). We’ve written a paper about some of our detailed experiences with it.
How do you feel about LessWrong these days? [Open feedback thread] — LessWrong

Dec 5, 2023 | lesswrong.com | Steven Byrnes |Matthew Barnett |Rohin Shah |Timothy Underwood

AI content for specialistsThere is a lot of AI content recently, and it is sometimes of the kind that requires specialized technical knowledge, which I (an ordinary software developer) do not have. Similarly, articles on decision theories are often written in a way that assumes a lot of background knowledge that I don't have. As a result there are many articles I don't even click at, and if I accidentally do, I just sigh and close them. This is not necessarily a bad thing.
How useful is mechanistic interpretability? — LessWrong

Dec 1, 2023 | lesswrong.com | Neel Nanda |Rohin Shah |Lucius Bushnaq |Sam Marks

Opening positionsI'm somewhat skeptical about mech interp (bottom-up or substantial reverse engineering style interp):Current work seems very far from being useful (it isn't currently useful) or explaining much what's going on inside of models in key cases. But it's hard to be very confident that a new field won't work! And things can be far from useful, but become useful via slowly becoming more powerful, etc.