Nina Rimsky's profile photo

Nina Rimsky

Featured in:

Articles

  • Jun 20, 2024 | lesswrong.com | Nina Rimsky |Sarah Ball

    This work was performed as part of SPARWe use activation steering (Turner et al., 2023; Rimsky et al., 2023) to investigate whether different types of jailbreaks operate via similar internal mechanisms. We find preliminary evidence that they may. Our analysis includes a wide range of jailbreaks such as harmful prompts developed in Wei et al. 2024, the universal jailbreak in Zou et al. (2023b), and the payload split jailbreak in Kang et al. (2023).

  • Jun 9, 2024 | lesswrong.com | Nina Rimsky

    I’m a big fan of the Soviet comedy directors Eldar Ryazanov, Leonid Gaidai, and Georgiy Daneliya.

  • Sep 15, 2023 | lesswrong.com | Nina Rimsky

    I agree that approximating the PBO makes this method more lossy (not all interesting generalization phenomena can be found). However, I think we can still glean useful information about generalization by considering "retraining" from a point closer to the final model than random initialization. The downside is if, for example, some data was instrumental in causing a phase transition at some point in training, this will not be captured by the PBO approximation.

  • Aug 28, 2023 | lesswrong.com | Rob Bensinger |Paul Crowley |Nina Rimsky

    The Ideological Turing Test is an exercise where you try to pretend to hold an opposing ideology convincingly enough that outside observers can't reliably distinguish you from a true believer. It was first described by economist Bryan Caplan:Put me and five random liberal social science Ph.D.s in a chat room. Let liberal readers ask questions for an hour, then vote on who isn't really a liberal. Then put [economist Paul] Krugman and five random libertarian social science Ph.D.s in a chat room.

  • Jul 28, 2023 | lesswrong.com | Nina Rimsky |Ethan Perez |Sam Bowman |Logan Riggs

    Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Evan Hubinger. I generate an activation steering vector using Anthropic's sycophancy dataset and then find that this can be used to increase or reduce performance on TruthfulQA, indicating a common direction between sycophancy on questions of opinion and untruthfulness on questions relating to common misconceptions.

Contact details

Socials & Sites

Try JournoFinder For Free

Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.

Start Your 7-Day Free Trial →