
János Kramár
Articles
-
Mar 18, 2024 |
lesswrong.com | Neel Nanda |János Kramár |Tom Lieberum |Rohin Shah
Authors: János Kramár, Tom Lieberum, Rohin Shah, Neel NandaA new paper from the Google DeepMind mechanistic interpretability team, from core contributors János Kramár and Tom LieberumTweet thread summary, paperAbstract:Activation Patching is a method of directly computing causal attributions of behavior to model components.
Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1) — LessWrong
Dec 23, 2023 |
lesswrong.com | Neel Nanda |János Kramár |Rohin Shah
This is great work, even though you weren't able to understand the memorization mechanistically. I agree that a big part of the reason to be pessimistic about ambitious mechanistic interp is that even very large neural networks are performing some amount of pure memorization. For example, existing LMs often can regurgitate some canary strings, which really seems like a case without any (to use your phrase) macrofeatures.
Try JournoFinder For Free
Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.
Start Your 7-Day Free Trial →