
Nicholas Schiefer
Articles
-
Jan 12, 2024 |
lesswrong.com | Carson Denison |David Kristjanson Duvenaud |Nicholas Schiefer |Ethan Perez
I'm not going to add a bunch of commentary here on top of what we've already put out, since we've put a lot of effort into the paper itself, and I'd mostly just recommend reading it directly, especially since there are a lot of subtle results that are not easy to summarize. I will say that I think this is some of the most important work I've ever done and I'm extremely excited for us to finally be able to share this.
-
Aug 8, 2023 |
lesswrong.com | Ethan Perez |Sam Marks |Nicholas Schiefer |Carson Denison
Thanks, I think there is a confusing point here about how narrowly we define "model organisms." The OP defines sycophantic reward hacking aswhere a model obtains good performance during training (where it is carefully monitored), but it pursues undesirable reward hacks (like taking over the reward channel, aggressive power-seeking, etc.) during deployment or in domains where it operates with less careful or effective human monitoring.
Try JournoFinder For Free
Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.
Start Your 7-Day Free Trial →