Adversarial Training

Featured in:

Articles

Evan Hubinger on training deceptive llms that persist through safety training

Feb 10, 2024 | theinsideview.ai | Michaël Trazzi |Adversarial Training

.. 2024-02-11 Evan Hubinger leads the Alignment stress-testing team at Anthropic and recently published “Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training”. In this interview we mostly discuss the Sleeper Agents paper, but also how this line of work relates to his work with Alignment Stress-testing, Model Organisms of Misalignment, Deceptive Instrumental Alignment or Responsible Scaling Policies.

Try JournoFinder For Free

Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.