Nicholas Schiefer

Featured in:

Articles

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training — LessWrong

Jan 12, 2024 | lesswrong.com | Carson Denison |David Kristjanson Duvenaud |Nicholas Schiefer |Ethan Perez

I'm not going to add a bunch of commentary here on top of what we've already put out, since we've put a lot of effort into the paper itself, and I'd mostly just recommend reading it directly, especially since there are a lot of subtle results that are not easy to summarize. I will say that I think this is some of the most important work I've ever done and I'm extremely excited for us to finally be able to share this.
Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research — LessWrong

Aug 8, 2023 | lesswrong.com | Ethan Perez |Sam Marks |Nicholas Schiefer |Carson Denison

Thanks, I think there is a confusing point here about how narrowly we define "model organisms." The OP defines sycophantic reward hacking aswhere a model obtains good performance during training (where it is carefully monitored), but it pursues undesirable reward hacks (like taking over the reward channel, aggressive power-seeking, etc.) during deployment or in domains where it operates with less careful or effective human monitoring.

Try JournoFinder For Free

Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.