Fabien Roger

Featured in:

Articles

A toy evaluation of inference code tampering — LessWrong

Dec 9, 2024 | lesswrong.com | Fabien Roger

Work done with James Faina, Evan Hubinger and Ethan Perez. This is ongoing research on speculative threats, which might be of interest to researchers in the field of AI safety. We ask that readers treat this as a work in progress, rather than as a mature research paper. To prevent highly capable and potentially misaligned LLMs from taking bad actions, we might want to monitor all of their outputs (Inan et al. 2023, Greenblatt 2024, Balesni 2024).
The case for unlearning that removes information from LLM weights — LessWrong

Oct 14, 2024 | lesswrong.com | Fabien Roger

What if you could remove some information from the weights of an AI? Would that be helpful? It is clearly useful against some misuse concerns: if you are concerned that LLMs will make it easier to build bioweapons because they have memorized such information, removing the memorized facts would remove this misuse concern.
Is cybercrime really costing trillions per year? — LessWrong

Sep 27, 2024 | lesswrong.com | Fabien Roger

Many sources report that cybercrime costs the global economy trillions of dollars per year. It is the top Google search result and it is quoted on Wikipedia. But I am not able to track down how the number was computed, or find criticism of these numbers. This would be insanely high if true: the world GDP is only 100 trillion / year, and the software industry is only around 1 trillion / year (according to a quick Google search).
An issue with training schemers with supervised fine-tuning — LessWrong

Jun 27, 2024 | lesswrong.com | Fabien Roger

This is a rough write-up of ideas by Ryan Greenblatt. The core ideas are his and mistakes are mine. Thanks to him, Buck Shlegeris and to Aghyad Deeb for feedback on the draft of this post. One way to use scheming AIs is to train them to imitate humans. When does imitation robustly avoid frequent and intentional failures?
Best-of-n with misaligned reward models for Math reasoning — LessWrong

Jun 21, 2024 | lesswrong.com | Fabien Roger

In this post, I share some quick results that I got as part of a bigger project that pivoted. TL;DR:When trying to generate Math reasoning with current LLMs, optimizing too hard against their judgment of what is “good reasoning” can make the proportion of correct reasoning diminish.
Memorizing weak examples can elicit strong behavior out of password-locked models — LessWrong

Jun 6, 2024 | lesswrong.com | Fabien Roger

We’ve recently done some research looking into sandbagging: examining when models can succeed at intentionally producing low-quality outputs despite attempts at fine-tuning them to perform well. One reason why sandbagging could be concerning is because scheming models might try to appear less capable than they actually are in capability evaluations, which could result in these models being deployed with insufficient countermeasures.
[Paper] Stress-testing capability elicitation by training password-locked models — LessWrong

Jun 4, 2024 | lesswrong.com | Fabien Roger

The paper is by Ryan Greenblatt, Fabien Roger, Dmitrii Krasheninnikov and David Krueger. This post was written by Fabien and Ryan, and may not reflect the views of Dmitrii and David. Scheming models might try to perform less capably than they are able to (sandbag). They might do this to make evaluations underestimate danger, slow down promising safety work, or cause issues with debate or recursive oversight more generally.
Open consultancy: Letting untrusted AIs choose what answer to argue for — LessWrong

Mar 12, 2024 | lesswrong.com | Fabien Roger

Hey this is cool work. I think i disagree with a couple of these comments coming from a Scalable Oversight backgroundrather than an AI control, and what weak/strong gaps we intend to look over. My main contention is the framing that you can talk about Consultants producing arguments with convincingness scores $c_i$ and $c_c$, which are independent of the protocol and judge that you are using. I try to break this down into the two claims.
Notes on control evaluations for safety cases — LessWrong

Feb 28, 2024 | lesswrong.com | Fabien Roger

The quality bar of this post is somewhat lower than that of our previous posts on control and this post is much more focused on specific details. This is particularly true about the appendices of this post. So, we only recommend reading for those who are quite interested in getting into the details of control evaluations. How can we evaluate the safety of a plan for training and using powerful AIs?
AI Control: Improving Safety Despite Intentional Subversion — LessWrong

Dec 13, 2023 | lesswrong.com | Fabien Roger |Kshitij Sachan

We’ve released a paper, AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes even if AI instances are colluding to subvert the safety techniques. In this post:We summarize the paper;We compare our methodology to the methodology of other safety papers.