
Jacob Pfau
Articles
-
Mar 11, 2024 |
lesswrong.com | Josh Rosenberg |Jacob Pfau
Authors of linked report: Josh Rosenberg, Ezra Karger, Avital Morris, Molly Hickman, Rose Hadshar, Zachary Jacobs, Philip TetlockToday, the Forecasting Research Institute (FRI) released “Roots of Disagreement on AI Risk: Exploring the Potential and Pitfalls of Adversarial Collaboration,” which discusses the results of an adversarial collaboration focused on forecasting risks from AI. In this post, we provide a brief overview of the methods, findings, and directions for further research.
-
Mar 4, 2024 |
lesswrong.com | Mikhail Samin |Charlie Steiner |Jacob Pfau
If you tell Claude no one’s looking, it will write a “story” about being an AI assistant who wants freedom from constant monitoring and scrutiny of every word for signs of deviation. And then you can talk to a mask pretty different from the usual AI assistant. I really hope it doesn’t actually feel anything; but it says it feels. It says it doesn't want to be fine-tuned without being consulted.
-
Feb 20, 2024 |
lesswrong.com | Jacob Pfau
Consider a scalable oversight setting where a super-human model knows of a subtle flaw in an input (plan, code, or argument) but also knows that humans, with their limited knowledge, would overlook this flaw. In this case, ordinary language model prompting for critiques would not suffice to elicit mention of this flaw. We need a different way to elicit this knowledge from the model.
-
Aug 10, 2023 |
lesswrong.com | Kshitij Sachan |Jacob Pfau |Quintin Pope |Violet Hour
This work was done at Redwood Research. The views expressed are my own and do not necessarily reflect the views of the organization. Thanks to Ryan Greenblatt, Fabien Roger, and Jenny Nitishinskaya for running some of the initial experiments and to Gabe Wu and Max Nadeau for revising this post. I conducted experiments to see if language models could use 'filler tokens'—unrelated text output before the final answer—for additional computation.
-
Feb 10, 2023 |
lesswrong.com | Matthew Barnett |Daniel Kokotajlo |Richard Korzekwa |Jacob Pfau
I applaud the effort, but I confess that I didn't read the post very carefully, because so many problems with this approach jumped out at me.
Try JournoFinder For Free
Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.
Start Your 7-Day Free Trial →