Joseph Miller

Featured in: Favicon

chiroeco.com

Articles

Activation space interpretability may be doomed — LessWrong

Jan 8, 2025 | lesswrong.com | Lucius Bushnaq |Daniel Murfet |Joseph Miller |Matthew Clarke

TL;DR: There may be a fundamental problem with interpretability work that attempts to understand neural networks by decomposing their individual activation spaces in isolation: It seems likely to find features of the activations - features that help explain the statistical structure of activation spaces, rather than features of the model - the features the model’s own computations make use of.
Transformer Circuit Faithfulness Metrics Are Not Robust — LessWrong

Jul 12, 2024 | lesswrong.com | Joseph Miller

When you think you've found a circuit in a language model, how do you know if it does what you think it does? Typically, you ablate / resample the activations of the model in order to isolate the circuit. Then you measure if the model can still perform the task you're investigating. We identify six ways in which ablation experiments often vary. How do these variations change the results of experiments that measure circuit faithfulness?
Joseph Miller's Shortform — LessWrong

May 21, 2024 | lesswrong.com | Joseph Miller

This is a special post for quick takes by. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page
How To Do Patching Fast — LessWrong

May 11, 2024 | lesswrong.com | Joseph Miller

This post outlines an efficient implementation of Edge Patching that massively outperforms common hook-based implementations. This implementation is available to use in my new library, AutoCircuit, and was first introduced by Li et al. (2023). I introduce new terminology to clarify the distinction between different types of activation patching. Node PatchingNode Patching (aka.
Why I'm doing PauseAI — LessWrong

Apr 30, 2024 | lesswrong.com | Joseph Miller

GPT-5 training is probably starting around now. It seems very unlikely that GPT-5 will cause the end of the world. But it’s hard to be sure. I would guess that GPT-5 is more likely to kill me than an asteroid, a supervolcano, a plane crash or a brain tumor. We can predict fairly well what the cross-entropy loss will be, but pretty much nothing else. Maybe we will suddenly discover that the difference between GPT-4 and superhuman level is actually quite small.
Mapping the semantic void: Strange goings-on in GPT embedding spaces — LessWrong

Dec 15, 2023 | lesswrong.com | Dmitry Vaintrob |Joseph Miller |Oliver Sourbut |Carl Feynman

TL;DR: GPT-J token embeddings inhabit a zone in their 4096-dimensional embedding space formed by the intersection of two hyperspherical shells. This is described, and then the remaining expanse of the embedding space is explored by using simple prompts to elicit definitions for non-token custom embedding vectors (so-called "nokens").
Sam Altman's sister, Annie Altman, claims Sam has severely abused her — LessWrong

Nov 17, 2023 | lesswrong.com | Joseph Miller |Seth Herd |Ben Pace |Adam Zerner

The simplest hypothesis that explains all this evidence is that Annie Altman is suffering from psychosis, and this would be obvious if we weren't all caught up in the metoo world order. E.g. the belief that all her devices, and her wifi were hacked, and that she has been shadowbanned from all internet platforms seems like the kind of thing that someone suffering from psychosis would believe. It's not a rational belief. It's called a persecutory delusion.
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning — LessWrong

Oct 5, 2023 | lesswrong.com | Zac Hatfield-Dodds |Matt Goldenberg |Joel Burget |Joseph Miller

Very interesting! Some thoughts:Is there a clear motivation for choosing the MLP activations as the autoencoder target?