Zac Hatfield-Dodds

Featured in: Favicon

themandarin.com.au

Articles

Anthropic's updated Responsible Scaling Policy — LessWrong

Oct 15, 2024 | lesswrong.com | Zac Hatfield-Dodds

Today we are publishing a significant update to our Responsible Scaling Policy (RSP), the risk governance framework we use to mitigate potential catastrophic risks from frontier AI systems. This update introduces a more flexible and nuanced approach to assessing and managing AI risks while maintaining our commitment not to train or deploy models unless we have implemented adequate safeguards.
PEP 789 – Preventing task-cancellation bugs by limiting yield in async generators | peps.python.org

Jun 3, 2024 | peps.python.org | Zac Hatfield-Dodds |Nathaniel Smith

No-yield exampleIn this example, we see multiple rounds of the stack merging as we unwind fromsys.prevent_yields, through the user-defined ContextManager, back to theoriginal Frame. For brevity, the reason for preventing yields is not shown;it is part of the “1 enter” state. With no yield we don’t raise any errors, and because the number of entersand exits balance the frame returns as usual with no further tracking.
Anthropic: Reflections on our Responsible Scaling Policy — LessWrong

May 20, 2024 | lesswrong.com | Zac Hatfield-Dodds

Last September we published our first Responsible Scaling Policy (RSP) [LW discussion], which focuses on addressing catastrophic safety failures and misuse of frontier models. In adopting this policy, our primary goal is to help turn high-level safety concepts into practical guidelines for fast-moving technical organizations and demonstrate their viability as possible standards. As we operationalize the policy, we expect to learn a great deal and plan to share our findings.
Simple probes can catch sleeper agents — LessWrong

Apr 23, 2024 | lesswrong.com | Carson Denison |Zac Hatfield-Dodds

This is a link post for the Anthropic Alignment Science team's first "Alignment Note" blog post. We expect to use this format to showcase early-stage research and work-in-progress updates more in the future. Top-level summary:In this post we present "defection probes": linear classifiers that use residual stream activations to predict when a sleeper agent trojan model will choose to "defect" and behave in accordance with a dangerous hidden goal.
Third-party testing as a key ingredient of AI policy — LessWrong

Mar 25, 2024 | lesswrong.com | Zac Hatfield-Dodds

(nb: this post is written for anyone interested, not specifically aimed at this forum)We believe that the AI sector needs effective third-party testing for frontier AI systems. Developing a testing regime and associated policy interventions based on the insights of industry, government, and academia is the best way to avoid societal harm—whether deliberate or accidental—from AI systems.
Decaeneus's Shortform — LessWrong

Jan 26, 2024 | lesswrong.com | Zac Hatfield-Dodds |Nate Showell

Pretending not to see when a rule you've set is being violated can be optimal policy in parenting sometimes (and I bet it generalizes). Example: suppose you have a toddler and a "rule" that food only stays in the kitchen. The motivation is that each time food is brough into the living room there is a small chance of an accident resulting in a permanent stain. There's cost to enforcing the rule as the toddler will put up a fight.
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning — LessWrong

Oct 5, 2023 | lesswrong.com | Zac Hatfield-Dodds |Matt Goldenberg |Joel Burget |Joseph Miller

Very interesting! Some thoughts:Is there a clear motivation for choosing the MLP activations as the autoencoder target?