
Dan Braun
Articles
-
2 months ago |
lesswrong.com | Lucius Bushnaq |Dan Braun |Stefan Hex |Lee Sharkey
This is a linkpost for Apollo Research's new interpretability paper: "Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition". We introduce a new method for directly decomposing neural network parameters into mechanistic components. At Apollo, we've spent a lot of time thinking about how the computations of neural networks might be structured, and how those computations might be embedded in networks' parameters.
-
Oct 5, 2024 |
lesswrong.com | Dan Braun
In which worlds would AI Control prevent significant harm? When I bring up the issue of AI model security to people working in AI safety, I’m often met with something of the form “yes, this is a problem. It’s important that people work hard on securing AI models. But it doesn’t really affect my work”.
-
Jul 18, 2024 |
lesswrong.com | Lee Sharkey |Lucius Bushnaq |Dan Braun |Stefan Hex
Why we made this list: The interpretability team at Apollo Research wrapped up a few projects recently. In order to decide what we’d work on next, we generated a lot of different potential projects. Unfortunately, we are computationally bounded agents, so we can't work on every project idea that we were excited about! Previous lists of project ideas (such as Neel’s collation of 200 Concrete Open Problems in Mechanistic Interpretability) have been very useful for people breaking into the field.
-
May 29, 2024 |
lesswrong.com | Marius Hobbhahn |Lee Sharkey |Lucius Bushnaq |Dan Braun
This is a linkpost for: www.apolloresearch.ai/blog/the-first-year-of-apollo-researchApollo Research is an evaluation organization focusing on risks from deceptively aligned AI systems. We conduct technical research on AI model evaluations and interpretability and have a small AI governance team. As of 29 May 2024, we are one year old. For the UK AI Safety Summit, we developed a demonstration that Large Language Models (LLMs) can strategically deceive their primary users when put under pressure.
-
May 20, 2024 |
lesswrong.com | Marius Hobbhahn |Lucius Bushnaq |Dan Braun |Nicholas Goldowsky-Dill
This is a linkpost for our two recent papers, produced at Apollo Research in collaboration with Kaarel Hanni (Cadenza Labs), Avery Griffin, Joern Stoehler, Magdalena Wache and Cindy Wu:An exploration of using degeneracy in the loss landscape for interpretability https://arxiv.org/abs/2405.10927An empirical test of an interpretability technique based on the loss landscape https://arxiv.org/abs/2405.10928Not to be confused with Apollo's recent Sparse Dictionary Learning paper.
Try JournoFinder For Free
Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.
Start Your 7-Day Free Trial →