
Adversarial Training
Featured in:
Articles
-
Feb 10, 2024 |
theinsideview.ai | Michaël Trazzi |Adversarial Training
.. 2024-02-11 Evan Hubinger leads the Alignment stress-testing team at Anthropic and recently published “Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training”. In this interview we mostly discuss the Sleeper Agents paper, but also how this line of work relates to his work with Alignment Stress-testing, Model Organisms of Misalignment, Deceptive Instrumental Alignment or Responsible Scaling Policies.
Try JournoFinder For Free
Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.
Start Your 7-Day Free Trial →