
Carson Denison
Articles
-
Jun 17, 2024 |
lesswrong.com | Carson Denison
Perverse incentives are everywhere. Think of the concept of "teaching to the test", where teachers focus on the narrow goal of exam preparation and fail to give their students a broader education. Or think of scientists working in the "publish or perish" academic system, publishing large numbers of low-quality papers to advance their careers at the expense of what we actually want them to produce: rigorous research.
-
Apr 23, 2024 |
lesswrong.com | Carson Denison |Zac Hatfield-Dodds
This is a link post for the Anthropic Alignment Science team's first "Alignment Note" blog post. We expect to use this format to showcase early-stage research and work-in-progress updates more in the future. Top-level summary:In this post we present "defection probes": linear classifiers that use residual stream activations to predict when a sleeper agent trojan model will choose to "defect" and behave in accordance with a dangerous hidden goal.
-
Jan 12, 2024 |
lesswrong.com | Carson Denison |David Kristjanson Duvenaud |Nicholas Schiefer |Ethan Perez
I'm not going to add a bunch of commentary here on top of what we've already put out, since we've put a lot of effort into the paper itself, and I'd mostly just recommend reading it directly, especially since there are a lot of subtle results that are not easy to summarize. I will say that I think this is some of the most important work I've ever done and I'm extremely excited for us to finally be able to share this.
-
Aug 8, 2023 |
lesswrong.com | Ethan Perez |Sam Marks |Nicholas Schiefer |Carson Denison
Thanks, I think there is a confusing point here about how narrowly we define "model organisms." The OP defines sycophantic reward hacking aswhere a model obtains good performance during training (where it is carefully monitored), but it pursues undesirable reward hacks (like taking over the reward channel, aggressive power-seeking, etc.) during deployment or in domains where it operates with less careful or effective human monitoring.
Try JournoFinder For Free
Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.
Start Your 7-Day Free Trial →