Nina Rimsky

Featured in:

Articles

Jailbreak steering generalization — LessWrong

Jun 20, 2024 | lesswrong.com | Nina Rimsky |Sarah Ball

This work was performed as part of SPARWe use activation steering (Turner et al., 2023; Rimsky et al., 2023) to investigate whether different types of jailbreaks operate via similar internal mechanisms. We find preliminary evidence that they may. Our analysis includes a wide range of jailbreaks such as harmful prompts developed in Wei et al. 2024, the universal jailbreak in Zou et al. (2023b), and the payload split jailbreak in Kang et al. (2023).
Soviet comedy film recommendations — LessWrong

Jun 9, 2024 | lesswrong.com | Nina Rimsky

I’m a big fan of the Soviet comedy directors Eldar Ryazanov, Leonid Gaidai, and Georgiy Daneliya.
Influence functions - why, what and how — LessWrong

Sep 15, 2023 | lesswrong.com | Nina Rimsky

I agree that approximating the PBO makes this method more lossy (not all interesting generalization phenomena can be found). However, I think we can still glean useful information about generalization by considering "retraining" from a point closer to the final model than random initialization. The downside is if, for example, some data was instrumental in causing a phase transition at some point in training, this will not be captured by the PBO approximation.
Ideological Turing Tests

Aug 28, 2023 | lesswrong.com | Rob Bensinger |Paul Crowley |Nina Rimsky

The Ideological Turing Test is an exercise where you try to pretend to hold an opposing ideology convincingly enough that outside observers can't reliably distinguish you from a true believer. It was first described by economist Bryan Caplan:Put me and five random liberal social science Ph.D.s in a chat room. Let liberal readers ask questions for an hour, then vote on who isn't really a liberal. Then put [economist Paul] Krugman and five random libertarian social science Ph.D.s in a chat room.
Reducing sycophancy and improving honesty via activation steering — LessWrong

Jul 28, 2023 | lesswrong.com | Nina Rimsky |Ethan Perez |Sam Bowman |Logan Riggs

Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Evan Hubinger. I generate an activation steering vector using Anthropic's sycophancy dataset and then find that this can be used to increase or reduce performance on TruthfulQA, indicating a common direction between sycophancy on questions of opinion and untruthfulness on questions relating to common misconceptions.
My tentative best guess on how EAs and Rationalists sometimes turn crazy — LessWrong

Jun 21, 2023 | lesswrong.com | Ben Pace |Nina Rimsky |Aaron Bergman |Joseph Van Name

Epistemic status: This is a pretty detailed hypothesis that I think overall doesn’t add up to more than 50% of my probability mass on explaining datapoints like FTX, Leverage Research, the LaSota crew etc., but is still my leading guess for what is going on. I might also be really confused about the whole topic.