János Kramár

Featured in:

Articles

AtP*: An efficient and scalable method for localizing LLM behaviour to components — LessWrong

Mar 18, 2024 | lesswrong.com | Neel Nanda |János Kramár |Tom Lieberum |Rohin Shah

Authors: János Kramár, Tom Lieberum, Rohin Shah, Neel NandaA new paper from the Google DeepMind mechanistic interpretability team, from core contributors János Kramár and Tom LieberumTweet thread summary, paperAbstract:Activation Patching is a method of directly computing causal attributions of behavior to model components.
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1) — LessWrong

Dec 23, 2023 | lesswrong.com | Neel Nanda |János Kramár |Rohin Shah

This is great work, even though you weren't able to understand the memorization mechanistically. I agree that a big part of the reason to be pessimistic about ambitious mechanistic interp is that even very large neural networks are performing some amount of pure memorization. For example, existing LMs often can regurgitate some canary strings, which really seems like a case without any (to use your phrase) macrofeatures.

Try JournoFinder For Free

Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.