
Lennart Buerger
Articles
-
Aug 22, 2024 |
lesswrong.com | Winnie Yang |Jojo Yang |Lennart Buerger
As models grow increasingly sophisticated, they will surpass human expertise. It is a fundamentally difficult challenge to make sure that those models are robustly aligned (Bowman et al., 2022). For example, we might hope to reliably know whether a model is being deceptive in order to achieve an instrumental goal. Importantly, deceptive alignment and robust alignment are behaviorally indistinguishable (Hubinger et al., 2024).
-
Jul 19, 2024 |
lesswrong.com | Lennart Buerger
A short summary of the paper is presented below. TL;DR: We develop a robust method to detect when an LLM is lying based on the internal model activations, making the following contributions: (i) We demonstrate the existence of a two-dimensional subspace, along which the activation vectors of true and false statements can be separated. Notably, this finding is universal and holds for various LLMs, including Gemma-7B, LLaMA2-13B and LLaMA3-8B.
Try JournoFinder For Free
Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.
Start Your 7-Day Free Trial →