Vikrant Varma's profile photo

Vikrant Varma

Articles

  • Jul 19, 2024 | lesswrong.com | Neel Nanda |Senthooran Rajamanoharan |Tom Lieberum |Vikrant Varma

    New paper from the Google DeepMind mechanistic interpretability team, led by Sen Rajamanoharan!We introduce JumpReLU SAEs, a new SAE architecture that replaces the standard ReLUs with discontinuous JumpReLU activations, and seems to be (narrowly) state of the art over existing methods like TopK and Gated SAEs for achieving high reconstruction at a given sparsity level, without a hit to interpretability.

  • Dec 18, 2023 | lesswrong.com | Vikrant Varma |Vlad Mikulik |Rohin Shah |Seb Farquhar

    TL;DR: Contrast-consistent search (CCS) seemed exciting to us and we were keen to apply it. At this point, we think it is unlikely to be directly helpful for implementations of alignment strategies (>95%). Instead of finding knowledge, it seems to find the most prominent feature. We are less sure about the wider category of unsupervised consistency-based methods, but tend to think they won’t be directly helpful either (70%). We’ve written a paper about some of our detailed experiences with it.

  • Sep 8, 2023 | lesswrong.com | Vikrant Varma |Rohin Shah

    Unless by "shrugs" you mean the details of what the partial hypothesis says in this particular case are still being worked out. Yes, that's what I mean. I do agree that it's useful to know whether a partial hypothesis says anything or not; overall I think this is good info to know / ask for. I think I came off as disagreeing more strongly than I actually did, sorry about that. Do you have any plans to do this?

Contact details

Socials & Sites

Try JournoFinder For Free

Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.

Start Your 7-Day Free Trial →