Dmitrii Kharlapenko's profile photo

Dmitrii Kharlapenko

Featured in:

Articles

  • Nov 14, 2024 | lesswrong.com | Daniel Tan |Dmitrii Kharlapenko |Neel Nanda |Arthur Conmy

    TLDR:Fluent dreaming for language models is an algorithm based on the GCG method that can reliably find plain-text readable prompts for LLMs that maximize certain logits or residual stream directions by using gradients and genetic algorithms. Authors showed its use for visualizing MLP neurons. We show this method can also help interpret SAE features.

  • Oct 12, 2024 | lesswrong.com | Dmitrii Kharlapenko |Arthur Conmy |Neel Nanda

    Steering vectors provide evidence that linear directions in LLMs are interpretable. Since SAEs decompose linear directions, they should be able to interpret steering vectors. We apply the gradient pursuit algorithm suggested by  to decompose steering vectors, and find that they contain many interpretable and promising-looking features. This builds off our prior work, which applies ITO and derivatives to steering vectors with less success.

  • Aug 12, 2024 | lesswrong.com | Dmitrii Kharlapenko |Neel Nanda |Arthur Conmy

    We try to study task vectors in the SAE basis. This is challenging because there is no canonical way to convert an arbitrary vector in the residual stream to a linear combination of SAE features — you can't just pass an arbitrary vector through the encoder without going off distribution. We explored the algorithm of gradient pursuit suggested in , but it didn’t work for us without modifications. Our approach is to apply the SAE encoder to the task vector, and then apply a gradient-based cleanup.

  • Aug 12, 2024 | lesswrong.com | Dmitrii Kharlapenko |Neel Nanda |Arthur Conmy

    We try to study task vectors in the SAE basis. This is challenging because there is no canonical way to convert an arbitrary vector in the residual stream to a linear combination of SAE features — you can't just pass an arbitrary vector through the encoder without going off distribution. We explored the algorithm of gradient pursuit suggested in , but it didn’t work for us without modifications. Our approach is to apply the SAE encoder to the task vector, and then apply a gradient-based cleanup.

  • Aug 5, 2024 | lesswrong.com | Dmitrii Kharlapenko |Neel Nanda |Arthur Conmy

    We apply the method of SelfIE/Patchscopes to explain SAE features – we give the model a prompt like “What does X mean?”, replace the residual stream on X with the decoder direction times some scale, and have it generate an explanation. We call this self-explanation. The natural alternative is auto-interp, using a larger LLM to spot patterns in max activating examples.

Contact details

Socials & Sites

Try JournoFinder For Free

Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.

Start Your 7-Day Free Trial →