Mike Vaiana

Featured in:

Articles

Self-Other Overlap: A Neglected Approach to AI Alignment — LessWrong

Jul 30, 2024 | lesswrong.com | Steve Byrnes |Marc Carauleanu |Mike Vaiana |Gunnar Zarnacke

Many thanks to Bogdan Ionut-Cirstea, Steve Byrnes, Gunnar Zarnacke, Jack Foxabbott and Seong Hah Cho for critical comments and feedback on earlier and ongoing versions of this work. SummaryIn this post, we introduce self-other overlap training: optimizing for similar internal representations when the model reasons about itself and others while preserving performance.
DIY RLHF: A simple implementation for hands on experience — LessWrong

Jul 10, 2024 | lesswrong.com | Mike Vaiana

Many thanks to Diogo de Lucena, Cameron Berg, Judd Rosenblatt, and Philip Gubbins for support and feedback on this post. TL;DRReinforcement Learning from Human Feedback (RLHF) is one of the leading methods for fine-tuning foundational models to be helpful, harmless, and honest. But it’s complicated and the standard implementation requires a pool of crowdsourced workers to provide feedback.

Try JournoFinder For Free

Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.