Zhu Xiaohu

Featured in:

Articles

Towards understanding-based safety evaluations — LessWrong

Mar 15, 2023 | lesswrong.com | Sam Bowman |Kshitij Sachan |Roman Leventov |Zhu Xiaohu

My summary:Evan expresses pessimism about our ability to use behavioral-based evaluations (like the capabilities evals ARC did for GPT-4) to test for alignment properties in the future. Detecting for alignment may be quite hard because you might be up against a highly capable adversary that is trying to evade detection; this might even be harder than training an aligned system to begin with.
Towards understanding-based safety evaluations — LessWrong

Mar 15, 2023 | lesswrong.com | Sam Bowman |Kshitij Sachan |Roman Leventov |Zhu Xiaohu

My summary:Evan expresses pessimism about our ability to use behavioral-based evaluations (like the capabilities evals ARC did for GPT-4) to test for alignment properties in the future. Detecting for alignment may be quite hard because you might be up against a highly capable adversary that is trying to evade detection; this might even be harder than training an aligned system to begin with.

Try JournoFinder For Free

Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.