
Jojo Yang
Articles
-
Aug 22, 2024 |
lesswrong.com | Winnie Yang |Jojo Yang |Lennart Buerger
As models grow increasingly sophisticated, they will surpass human expertise. It is a fundamentally difficult challenge to make sure that those models are robustly aligned (Bowman et al., 2022). For example, we might hope to reliably know whether a model is being deceptive in order to achieve an instrumental goal. Importantly, deceptive alignment and robust alignment are behaviorally indistinguishable (Hubinger et al., 2024).
-
Aug 22, 2024 |
lesswrong.com | Winnie Yang |Jojo Yang
As models grow increasingly sophisticated, they will surpass human expertise. It is a fundamentally difficult challenge to make sure that those models are robustly aligned (cite scalable oversight). For example, we might hope to reliably know whether a model is being deceptive in order to achieve an instrumental goal (Hubinger et al., 2024). Importantly, deceptive alignment and robust alignment are behaviorally indistinguishable (Hubinger et al., 2024).
Try JournoFinder For Free
Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.
Start Your 7-Day Free Trial →