
Kshitij Sachan
Articles
-
Dec 13, 2023 |
lesswrong.com | Fabien Roger |Kshitij Sachan
We’ve released a paper, AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes even if AI instances are colluding to subvert the safety techniques. In this post:We summarize the paper;We compare our methodology to the methodology of other safety papers.
-
Aug 10, 2023 |
lesswrong.com | Kshitij Sachan |Jacob Pfau |Quintin Pope |Violet Hour
This work was done at Redwood Research. The views expressed are my own and do not necessarily reflect the views of the organization. Thanks to Ryan Greenblatt, Fabien Roger, and Jenny Nitishinskaya for running some of the initial experiments and to Gabe Wu and Max Nadeau for revising this post. I conducted experiments to see if language models could use 'filler tokens'—unrelated text output before the final answer—for additional computation.
-
Mar 15, 2023 |
lesswrong.com | Sam Bowman |Kshitij Sachan |Roman Leventov |Zhu Xiaohu
My summary:Evan expresses pessimism about our ability to use behavioral-based evaluations (like the capabilities evals ARC did for GPT-4) to test for alignment properties in the future. Detecting for alignment may be quite hard because you might be up against a highly capable adversary that is trying to evade detection; this might even be harder than training an aligned system to begin with.
-
Mar 15, 2023 |
lesswrong.com | Sam Bowman |Kshitij Sachan |Roman Leventov |Zhu Xiaohu
My summary:Evan expresses pessimism about our ability to use behavioral-based evaluations (like the capabilities evals ARC did for GPT-4) to test for alignment properties in the future. Detecting for alignment may be quite hard because you might be up against a highly capable adversary that is trying to evade detection; this might even be harder than training an aligned system to begin with.
Try JournoFinder For Free
Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.
Start Your 7-Day Free Trial →